Methods and systems for image or audio recognition processing

ABSTRACT

Many of the detailed technologies are useful in enabling a smart phone to respond to a user&#39;s environment, e.g., so it can serve as an intuitive hearing and seeing device. A few of the detailed arrangements involve optimizing division of shared processing tasks between the phone and remote devices; using a phone GPU for exhaustive speculative execution and machine vision purposes (including facial recognition); novel device architectures involving abstraction layers that facilitate substitution of different local and remote services; interactions with private networks as they relate to audio/image processing; adapting the orders in which operations are executed, and the types of data that are exchanged with remote servers, in accordance with current context; reconfiguring networks based on sensed social affiliations among users and in accordance with predictive models of user behavior; etc. A great variety of other features and arrangements are also detailed.

RELATED APPLICATION DATA

This application is a continuation of application Ser. No. 12/855,996,filed Aug. 13, 2010 (now U.S. Pat. No. 8,768,313), which is anon-provisional of provisional application 61/234,542, filed Aug. 17,2009.

The present technology builds on, and extends, technology disclosed inother patent applications by the present assignee. The reader is thusdirected to the following applications that serve to detail arrangementsin which applicants intend the present technology to be applied, andthat technically supplement the present disclosure:

Application Ser. No. 12/271,772, filed Nov. 14, 2008 (published as20100119208);

Application Ser. No. 61/150,235, filed Feb. 5, 2009;

Application Ser. No. 61/157,153, filed Mar. 3, 2009;

Application Ser. No. 61/167,828, filed Apr. 8, 2009;

Application Ser. No. 12/468,402, filed May 19, 2009 (now U.S. Pat. No.8,004,576);

Application Ser. No. 12/484,115, filed Jun. 12, 2009 (now U.S. Pat. No.8,385,971); and

Application Ser. No. 12/498,709, filed Jul. 7, 2009 (published as20100261465).

The disclosures of all the above-identified documents are incorporatedherein by reference.

INTRODUCTION

The present specification details a diversity of technologies, assembledover an extended period of time, to serve a variety of differentobjectives. Yet they relate together in various ways, and can be used inconjunction, and so are presented collectively in this single document.

This varied, interrelated subject matter does not lend itself to astraightforward presentation. Thus, the reader's indulgence is solicitedas this narrative occasionally proceeds in nonlinear fashion among theassorted topics and technologies.

BACKGROUND

Digimarc's U.S. Pat. No. 6,947,571 shows a system in which a cell phonecamera captures content (e.g., image data), and processes same to deriveinformation related to the imagery. This derived information issubmitted to a data structure (e.g., a remote database), which indicatescorresponding data or actions. The cell phone then displays responsiveinformation, or takes responsive action. Such sequence of operations issometimes referred to as “visual search.”

Related technologies are shown in patent publications 20080300011(Digimarc), U.S. Pat. No. 7,283,983 and WO07/130688 (EvolutionRobotics), 20070175998 and 20020102966 (DSPV), 20060012677, 20060240862and 20050185060 (Google), 20060056707 and 20050227674 (Nokia),20060026140 (ExBiblio), U.S. Pat. No. 6,491,217, 20020152388,20020178410 and 20050144455 (Philips), 20020072982 and 20040199387(Shazam), 20030083098 (Canon), 20010055391 (Qualcomm), 20010001854(AirClic), U.S. Pat. No. 7,251,475 (Sony), U.S. Pat. No. 7,174,293(Iceberg), U.S. Pat. No. 7,065,559 (Organnon Wireless), U.S. Pat. No.7,016,532 (Evryx Technologies), U.S. Pat. Nos. 6,993,573 and 6,199,048(Neomedia), U.S. Pat. No. 6,941,275 (Tune Hunter), U.S. Pat. No.6,788,293 (Silverbrook Research), U.S. Pat. Nos. 6,766,363 and 6,675,165(BarPoint), U.S. Pat. No. 6,389,055 (Alcatel-Lucent), U.S. Pat. No.6,121,530 (Sonoda), and U.S. Pat. No. 6,002,946 (Reber/Motorola).

The presently-detailed technology concerns improvements to suchtechnologies—moving towards the goal of intuitive computing: devicesthat can see and/or hear, and infer the user's desire in that sensedcontext.

SELECTED FEATURES

As will be apparent, the present specification details a wealth of noveltechnologies. To give the reader an introductory overview, a few sucharrangements are reviewed in the following paragraphs:

(A) A portable device that receives input from one or more physicalsensors, employs processing by one or more local services, and alsoemploys processing by one or more remote services, wherein software inthe device includes one or more abstraction layers through which saidsensors, local services, and remote services are interfaced to thedevice architecture, facilitating substitution.

(B) A portable device that receives input from one or more physicalsensors, processes the input and packages the result into keyvectorform, and transmits the keyvector form from the device. Also, such anarrangement in which the device receives a further-processed counterpartto the keyvector back from a remote resource to which the keyvector wastransmitted. Also, such an arrangement in which the keyvector form isprocessed—on the portable device or a remote device—in accordance withone or more instructions that are implied in accord with context.

(C) A distributed processing architecture for responding to physicalstimulus sensed by a cell phone (aka “smart phone”), the architectureemploying a local process on the cell phone, and a remote process on aremote computer, the two processes being linked by a packet network andan inter-process communication construct, the architecture alsoincluding a protocol by which different processes may communicate, thisprotocol including a message passing paradigm with either a messagequeue, or a collision handling arrangement. Also, such an arrangement inwhich driver software for one or more physical sensor componentsprovides sensor data in packet form and places the packet on an outputqueue, either uniquely associated with that sensor or common to pluralcomponents; wherein local processes operate on the packets and placeresultant packets back on the queue, unless the packet is to beprocessed remotely, in which case it is directed to a remote process bya router arrangement.

(D) An arrangement in which a network associated with a particularphysical venue is adapted to automatically discern whether a set ofvisitors to the venue have a social connection, by reference to trafficon the network. Also, such an arrangement that also includes discerninga demographic characteristic of the group. Also, such an arrangement inwhich the network facilitates ad hoc networking among visitors who arediscerned to have a social connection.

(E) An arrangement wherein a network including computer resources at apublic venue is dynamically reconfigured in accordance with a predictivemodel of behavior of users visiting said venue. Also, such anarrangement in which the network reconfiguration is based, in part, oncontext. Also, such an arrangement wherein the network reconfigurationincludes caching certain content. Also, such an arrangement in which thereconfiguration includes rendering synthesized content and storing inone or more of the computer resources to make same more rapidlyavailable. Also, such an arrangement that includes throttling backtime-insensitive network traffic in anticipation of a temporal increasein traffic from the users.

(F) An arrangement in which advertising is associated with real worldcontent, and a charge therefore is assessed based on surveys of exposureto said content—as indicated by sensors in users' cell phones. Also,such an arrangement in which the charged is set through use of anautomated auction arrangement.

(G) An arrangement including two subjects in a public venue, whereinillumination on said subjects is changed differently—based on anattribute of a person proximate to the subjects.

(H) An arrangement in which content is presented to persons in a publicvenue, and there is a link between the presented content and auxiliarycontent, wherein the linked auxiliary content is changed in accordancewith a demographic attribute of a person to whom the content ispresented.

(I) An arrangement wherein a temporary electronic license to certaincontent is granted to a person in connection with the person's visit toa public venue.

(J) An arrangement for recognition processing of stimuli captured by asensor of a user's mobile device, in which some processing tasks can beperformed on processing hardware in the device, and other processingtasks can be performed on a processor—or plural—remote from the device,and in which a decision regarding whether a first task should beperformed on the device hardware or on a remote processor is made inautomated fashion based on consideration of at least one factor drawnfrom both of the following groups: (1) bandwidth costs, external serviceprovider costs, power costs to the cell phone battery, intangible costsin consumer (dis-)satisfaction by delaying processing, availableprocessing capacity of the remote processor(s), distance to the remoteprocessor(s); and (2) routing constraints, geographical considerationsother than distance to the remote processor(s), risk of pipeline stall,and the relation of the first task to other processing tasks; wherein insome circumstances the first task is performed on the device hardware,and in other circumstances the first task is performed on the remoteprocessor. Also, such an arrangement in which the decision is based on ascore dependent on a combination of parameters related to at least someof the listed considerations.

(K) An arrangement for processing stimuli captured by a sensor of auser's mobile device, in which some processing tasks can be performed onprocessing hardware in the device, and other processing tasks can beperformed on a processor—or plural processors—remote from the device,and in which a sequence in which a set of tasks should be performed ismade in automated fashion based on consideration of two or moredifferent factors drawn from a set that includes at least: mobile devicepower considerations; response time needed; routing constraints; stateof hardware resources within the mobile device; connectivity status;geographical considerations; risk of pipeline stall; information aboutthe remote processor including its readiness, processing speed, cost,and attributes of importance to a user of the mobile device; and therelation of the task to other processing tasks; wherein in somecircumstances the set of tasks is performed in a first sequence, and inother circumstances the set of tasks is performed in a second,different, sequence. Also, such an arrangement in which the decision isbased on a score dependent on a combination of parameters related to atleast some of the listed considerations.

(L) An arrangement for processing stimuli captured by a sensor of auser's mobile device, in which some processing tasks can be performed onprocessing hardware in the device, and other processing tasks can beperformed on a processor—or plural processors—remote from the device,and in which packets are employed to convey data between processingtasks, and the contents of the packets are determined in automatedfashion based on consideration of two or more different factors drawnfrom a set that includes at least: mobile device power considerations;response time needed; routing constraints; state of hardware resourceswithin the mobile device; connectivity status; geographicalconsiderations; risk of pipeline stall; information about the remoteprocessor including its readiness, processing speed, cost, andattributes of importance to a user of the mobile device; and therelation of the task to other processing tasks; wherein in somecircumstances the packets may include data of a first form, and in othercircumstances the packets may include data of a second form. Also, suchan arrangement in which the decision is based on a score dependent on acombination of parameters related to at least some of the listedconsiderations.

(M) An arrangement wherein a venue provides data services to usersthrough a network, and the network is arranged to deter use ofelectronic imaging by users while in the venue. Also, such anarrangement in which the deterrence is effected by restrictingtransmission of data from user devices to certain data processingproviders external to the network.

(N) An arrangement in which a mobile communications device with imagecapture capability includes a pipelined processing chain for performinga first operation, and a control system that has a mode in which ittests image data by performing a second operation thereon, the secondoperation being computationally simpler than the first operation, andthe control system applies image data to the pipelined processing chainonly if the second operation produces an output of a first type.

(O) An arrangement in which a cell phone is equipped with a GPU tofacilitate rendering of graphics for display on a cell phone screen,e.g., for gaming, and the GPU is also employed for machine visionpurposes. Also, such an arrangement in which the machine vision purposeincludes facial detection.

(P) An arrangement in which plural socially-affiliated mobile devices,maintained by different individuals, cooperate in performing a machinevision operation. Also, such an arrangement wherein a first of thedevices performs an operation to extract facial features from an image,and a second of the devices performs template matching on the extractedfacial features produced by the first device.

(Q) An arrangement in which a voice recognition operation is performedon audio from an incoming video or phone call to identify a caller.Also, such an arrangement in which the voice recognition operation isperformed only if the incoming call is not identified by CallerID data.Also, such an arrangement in which the voice recognition operationincludes reference to data corresponding to one or more earlier-storedvoice messages.

(R) An arrangement in which speech from an incoming video or phone callis recognized, and text data corresponding thereto is generated as thecall is in process. Also, such an arrangement in which the incoming callis associated with a particular geography, and such geography is takeninto account in recognizing the speech. Also, such an arrangement inwhich the text data is used to query a data structure for auxiliaryinformation.

(S) An arrangement for populating overlay baubles onto a mobile devicescreen, derived from both local and cloud processing. Also, such anarrangement in which the overlay baubles are tuned in accordance withuser preference information.

(T) An arrangement wherein a user may (1) be charged by a vendor for adata processing service, or alternatively (2) may be provided theservice free or even receive credit from the vendor if the user takescertain action in relation thereto.

(U) An arrangement in which a user receives a commercial benefit inexchange for being presented with promotional content—as sensed by amobile device conveyed by the user.

(V) An arrangement in which a first user allows a second party to expendcredits of the first user, or incur expenses to be borne by the firstuser, by reason of a social networking connection between the first userand the second party. Also, such an arrangement in which a socialnetworking web page is a construct with which the second party mustinteract in expending such credits, or incurring such expenses.

(W) An arrangement for charitable fundraising, in which a user interactswith a physical object associated with a charitable cause, to trigger acomputer-related process facilitating a user donation to a charity.

(X) An arrangement wherein visual query data is processed in distributedfashion between a user's mobile device and cloud resources, to generatea response, and wherein related information is archived in the cloud andprocessed so that subsequent visual query data can generate a moreintuitive response.

The foregoing and many other features and advantages of the presenttechnology will be further apparent from the following detaileddescription, which proceeds with reference to the accompanying drawings.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a high level view of an embodiment incorporating aspects ofthe present technology.

FIG. 2 shows some of the applications that a user may request acamera-equipped cell phone to perform.

FIG. 3 identifies some of the commercial entities in an embodimentincorporating aspects of the present technology.

FIGS. 4, 4A and 4B conceptually illustrate how pixel data, andderivatives, is applied in different tasks, and packaged into packetform.

FIG. 5 shows how different tasks may have certain image processingoperations in common.

FIG. 6 is a diagram illustrating how common image processing operationscan be identified, and used to configure cell phone processing hardwareto perform these operations.

FIG. 7 is a diagram showing how a cell phone can send certainpixel-related data across an internal bus for local processing, and sendother pixel-related data across a communications channel for processingin the cloud.

FIG. 8 shows how the cloud processing in FIG. 7 allows tremendously more“intelligence” to be applied to a task desired by a user.

FIG. 9 details how keyvector data is distributed to different externalservice providers, who perform services in exchange for compensation,which is handled in consolidated fashion for the user.

FIG. 10 shows an embodiment incorporating aspects of the presenttechnology, noting how cell phone-based processing is suited for simpleobject identification tasks—such as template matching, whereascloud-based processing is suited for complex tasks—such as dataassociation.

FIG. 10A shows an embodiment incorporating aspects of the presenttechnology, noting that the user experience is optimized by performingvisual keyvector processing as close to a sensor as possible, andadministering traffic to the cloud as low in a communications stack aspossible.

FIG. 11 illustrates that tasks referred for external processing can berouted to a first group of service providers who routinely performcertain tasks for the cell phone, or can be routed to a second group ofservice providers who compete on a dynamic basis for processing tasksfrom the cell phone.

FIG. 12 further expands on concepts of FIG. 11, e.g., showing how a bidfilter and broadcast agent software module may oversee a reverse auctionprocess.

FIG. 13 is a high level block diagram of a processing arrangementincorporating aspects of the present technology.

FIG. 14 is a high level block diagram of another processing arrangementincorporating aspects of the present technology.

FIG. 15 shows an illustrative range of image types that may be capturedby a cell phone camera.

FIG. 16 shows a particular hardware implementation incorporating aspectsof the present technology.

FIG. 17 illustrates aspects of a packet used in an exemplary embodiment.

FIG. 18 is a block diagram illustrating an implementation of the SIFTtechnique.

FIG. 19 is a block diagram illustrating, e.g., how packet header datacan be changed during processing, through use of a memory.

FIG. 19A shows a prior art architecture from the robotic Player Project.

FIG. 19B shows how various factors can influence how differentoperations may be handled.

FIG. 20 shows an arrangement by which a cell phone camera and a cellphone projector share a lens.

FIG. 20A shows a reference platform architecture that can be used inembodiments of the present technology.

FIG. 21 shows an image of a desktop telephone captured by a cell phonecamera.

FIG. 22 shows a collection of similar images found in a repository ofpublic images, by reference to characteristics discerned from the imageof FIG. 21.

FIGS. 23-28A, and 30-34 are flow diagrams detailing methodsincorporating aspects of the present technology.

FIG. 29 is an arty shot of the Eiffel Tower, captured by a cell phoneuser.

FIG. 35 is another image captured by a cell phone user.

FIG. 36 is an image of an underside of a telephone, discovered usingmethods according to aspects of the present technology.

FIG. 37 shows part of the physical user interface of one style of cellphone.

FIGS. 37A and 37B illustrate different linking topologies.

FIG. 38 is an image captured by a cell phone user, depicting anAppalachian Trail trail marker.

FIGS. 39-43 detail methods incorporating aspects of the presenttechnology.

FIG. 44 shows the user interface of one style of cell phone.

FIGS. 45A and 45B illustrate how different dimensions of commonality maybe explored through use of a user interface control of a cell phone.

FIGS. 46A and 46B detail a particular method incorporating aspects ofthe present technology, by which keywords such as Prometheus and PaulManship are automatically determined from a cell phone image.

FIG. 47 shows some of the different data sources that may be consultedin processing imagery according to aspects of the present technology.

FIGS. 48A, 48B and 49 show different processing methods according toaspects of the present technology.

FIG. 50 identifies some of the different processing that may beperformed on image data, in accordance with aspects of the presenttechnology.

FIG. 51 shows an illustrative tree structure that can be employed inaccordance with certain aspects of the present technology.

FIG. 52 shows a network of wearable computers (e.g., cell phones) thatcan cooperate with each other, e.g., in a peer-to-peer network.

FIGS. 53-55 detail how a glossary of signs can be identified by a cellphone, and used to trigger different actions.

FIG. 56 illustrates aspects of prior art digital camera technology.

FIG. 57 details an embodiment incorporating aspects of the presenttechnology.

FIG. 58 shows how a cell phone can be used to sense and display affineparameters.

FIG. 59 illustrates certain state machine aspects of the presenttechnology.

FIG. 60 illustrates how even “still” imagery can include temporal, ormotion, aspects.

FIG. 61 shows some metadata that may be involved in an implementationincorporating aspects of the present technology.

FIG. 62 shows an image that may be captured by a cell phone camera user.

FIGS. 63-66 detail how the image of FIG. 62 can be processed to conveysemantic metadata.

FIG. 67 shows another image that may be captured by a cell phone camerauser.

FIGS. 68 and 69 detail how the image of FIG. 67 can be processed toconvey semantic metadata.

FIG. 70 shows an image that may be captured by a cell phone camera user.

FIG. 71 details how the image of FIG. 70 can be processed to conveysemantic metadata.

FIG. 72 is a chart showing aspects of the human visual system.

FIG. 73 shows different low, mid and high frequency components of animage.

FIG. 74 shows a newspaper page.

FIG. 75 shows the layout of the FIG. 74 page, as set by layout software.

FIG. 76 details how user interaction with imagery captured from printedtext may be enhanced.

FIGS. 77A and 77B illustrate how semantic conveyance of metadata canhave a progressive aspect, akin to JPEG2000 and the like.

DETAILED DESCRIPTION

The present specification details a collection of interrelated work,including a great variety of technologies. A few of these include imageprocessing architectures for cell phones, cloud-based computing, reverseauction-based delivery of services, metadata processing, imageconveyance of semantic information, etc., etc., etc. Each portion of thespecification details technology that desirably incorporates technicalfeatures detailed in other portions. Thus, it is difficult to identify“a beginning” from which this disclosure should logically begin. Thatsaid, we simply dive in.

Mobile Device Object Recognition and Interaction Using DistributedNetwork Services

There is presently a huge disconnect between the unfathomable volume ofinformation that is contained in high quality image data streaming froma mobile device camera (e.g., in a cell phone), and the ability of thatmobile device to process this data to whatever end. “Off device”processing of visual data can help handle this fire hose of data,especially when a multitude of visual processing tasks may be desired.These issues become even more critical once “real time objectrecognition and interaction” is contemplated, where a user of the mobiledevice expects virtually instantaneous results and augmented realitygraphic feedback on the mobile device screen, as that user points thecamera at a scene or object.

In accordance with one aspect of the present technology, a distributednetwork of pixel processing engines serve such mobile device users andmeet most qualitative “human real time interactivity” requirements,generally with feedback in much less than one second. Implementationdesirably provides certain basic features on the mobile device,including a rather intimate relationship between the image sensor'soutput pixels and the native communications channel available to themobile device. Certain levels of basic “content filtering andclassification” of the pixel data on the local device, followed byattaching routing instructions to the pixel data as specified by theuser's intentions and subscriptions, leads to an interactive sessionbetween a mobile device and one or more “cloud based” pixel processingservices. The key word “session” further indicates fast responsestransmitted back to the mobile device, where for some services marketedas “real time” or “interactive,” a session essentially represents aduplex, generally packet-based, communication, where several outgoing“pixel packets” and several incoming response packets (which may bepixel packets updated with the processed data) may occur every second.

Business factors and good old competition are at the heart of thedistributed network. Users can subscribe to or otherwise tap into anyexternal services they choose. The local device itself and/or thecarrier service provider to that device can be configured as the userchooses, routing filtered and pertinent pixel data to specified objectinteraction services. Billing mechanisms for such services can directlyplug into existing cell and/or mobile device billing networks, whereinusers get billed and service providers get paid.

But let's back up a bit. The addition of camera systems to mobiledevices has ignited an explosion of applications. The primordialapplication certainly must be folks simply snapping quick visual aspectsof their environment and sharing such pictures with friends and family.

The fanning out of applications from that starting point arguably hingeson a set of core plumbing features inherent in mobile cameras. In short(and non-exhaustive of course), such features include: a) higher qualitypixel capture and low level processing; b) better local device CPU andGPU resources for on-device pixel processing with subsequent userfeedback; c) structured connectivity into “the cloud;” and importantly,d) a maturing traffic monitoring and billing infrastructure. FIG. 1 isbut one graphic perspective on some of these plumbing features of whatmight be called a visually intelligent network. (Conventional details ofa cell phone, such as the microphone, A/D converter, modulation anddemodulation systems, IF stages, cellular transceiver, etc., are notshown for clarity of illustration.)

It is all well and good to get better CPUs and GPUs, and more memory, onmobile devices. However, cost, weight and power considerations seem tofavor getting “the cloud” to do as much of the “intelligence” heavylifting as possible.

Relatedly, it seems that there should be a common denominator set of“device-side” operations performed on visual data that will serve allcloud processes, including certain formatting, elemental graphicprocessing, and other rote operations. Similarly, it seems there shouldbe a standardized basic header and addressing scheme for the resultingcommunication traffic (typically packetized) back and forth with thecloud.

This conceptualization is akin to the human visual system. The eyeperforms baseline operations, such as chromaticity groupings, and itoptimizes necessary information for transmission along the optic nerveto the brain. The brain does the real cognitive work. And there'sfeedback the other way too—with the brain sending informationcontrolling muscle movement—where to point eyes, scanning lines of abook, controlling the iris (lighting), etc.

FIG. 2 depicts a non-exhaustive but illustrative list of visualprocessing applications for mobile devices. Again, it is hard not to seeanalogies between this list and the fundamentals of how the human visualsystem and the human brain operate. It is a well studied academic areathat deals with how “optimized” the human visual system is relative toany given object recognition task, where a general consensus is that theeye-retina-optic nerve-cortex system is pretty darn wonderful in howefficiently it serves a vast array of cognitive demands. This aspect ofthe technology relates to how similarly efficient and broadly enablingelements can be built into mobile devices, mobile device connections andnetwork services, all with the goal of serving the applications depictedin FIG. 2, as well as those new applications which may show up as thetechnology dance continues.

Perhaps the central difference between the human analogy and mobiledevice networks must surely revolve around the basic concept of “themarketplace,” where buyers buy better and better things so long asbusinesses know how to profit accordingly. Any technology which aims toserve the applications listed in FIG. 2 must necessarily assume thathundreds if not thousands of business entities will be developing thenitty gritty details of specific commercial offerings, with theexpectation of one way or another profiting from those offerings. Yes, afew behemoths will dominate main lines of cash flows in the overallmobile industry, but an equal certainty will be that niche players willbe continually developing niche applications and services. Thus, thisdisclosure describes how a marketplace for visual processing servicescan develop, whereby business interests across the spectrum havesomething to gain. FIG. 3 attempts a crude categorization of some of thebusiness interests applicable to the global business ecosystem operativein the era of this filing.

FIG. 4 sprints toward the abstract in the introduction of the technologyaspect now being considered. Here we find a highly abstracted bit ofinformation derived from some batch of photons that impinged on someform of electronic image sensor, with a universe of waiting consumers ofthat lowly bit. FIG. 4A then quickly introduces the intuitivelywell-known concept that singular bits of visual information aren't worthmuch outside of their role in both spatial and temporal groupings. Thiscore concept is well exploited in modern video compression standardssuch as MPEG7 and H.264.

The “visual” character of the bits may be pretty far removed from thevisual domain by certain of the processing (consider, e.g., the vectorstrings representing eigenface data). Thus, we sometimes use the term“keyvector data” (or “keyvector strings”) to refer collectively to rawsensor/stimulus data (e.g., pixel data), and/or to processed informationand associated derivatives. A keyvector may take the form of a containerin which such information is conveyed (e.g., a data structure such as apacket). A tag or other data can be included to identify the type ofinformation (e.g., JPEG image data, eigenface data), or the data typemay be otherwise evident from the data or from context. One or moreinstructions, or operations, may be associated with keyvectordata—either expressly detailed in the keyvector, or implied. Anoperation may be implied in default fashion, for keyvector data ofcertain types (e.g., for JPEG data it may be “store the image;” foreigenface data is may be “match this eigenface template”). Or an impliedoperation may be dependent on context.

FIGS. 4A and 4B also introduce a central player in this disclosure: thepackaged and address-labeled pixel packet, into a body of whichkeyvector data is inserted. The keyvector data may be a single patch, ora collection of patches, or a time-series of patches/collections. Apixel packet may be less than a kilobyte, or its size can be much muchlarger. It may convey information about an isolated patch of pixelsexcerpted from a larger image, or it may convey a massive Photosynth ofNotre Dame cathedral.

(As presently conceived, a pixel packet is an application layerconstruct. When actually pushed around a network, however, it may bebroken into smaller portions—as transport layer constraints in a networkmay require.)

FIG. 5 is a segue diagram—still at an abstract level, but pointingtoward the concrete. A list of user-defined applications, such asillustrated in FIG. 2, will map to a state-of-the-art inventory of pixelprocessing methods and approaches which can accomplish each and everyapplication. These pixel processing methods break down into common andnot-so-common component sub-tasks. Object recognition textbooks arefilled with a wide variety of approaches and terminologies which bring asense of order into what at first glance might appear to be abewildering array of “unique requirements” relative to the applicationsshown in FIG. 2. (In addition, multiple computer vision and imageprocessing libraries, such as OpenCV and CMVision—discussed below, havebeen created that identify and render functional operations, which canbe considered “atomic” functions within object recognition paradigms.)But FIG. 5 attempts to show that there are indeed a set of common stepsand processes shared between visual processing applications. Thedifferently shaded pie slices attempt to illustrate that certain pixeloperations are of a specific class and may simply have differences inlow level variables or optimizations. The size of the overall pie(thought of in a logarithmic sense, where a pie twice the size ofanother may represent 10 times more Flops, for example), and thepercentage size of the slice, represent degrees of commonality.

FIG. 6 takes a major step toward the concrete, sacrificing simplicity inthe process. Here we see a top portion labeled “Resident Call-Up VisualProcessing Services,” which represents all of the possible list ofapplications from FIG. 2 that a given mobile device may be aware of, ordownright enabled to perform. The idea is that not all of theseapplications have to be active all of the time, and hence some sub-setof services is actually “turned on” at any given moment. The turned onapplications, as a one-time configuration activity, negotiate toidentify their common component tasks, labeled the “Common ProcessesSorter”—first generating an overall common list of pixel processingroutines available for on-device processing, chosen from a library ofthese elemental image processing routines (e.g., FFT, filtering, edgedetection, resampling, color histogramming, log-polar transform, etc.).Generation of corresponding Flow Gate Configuration/Software Programminginformation follows, which literally loads library elements intoproperly ordered places in a field programmable gate array set-up, orotherwise configures a suitable processor to perform the requiredcomponent tasks.

FIG. 6 also includes depictions of the image sensor, followed by auniversal pixel segmenter. This pixel segmenter breaks down the massivestream of imagery from the sensor into manageable spatial and/ortemporal blobs (e g, akin to MPEG macroblocks, wavelet transform blocks,64×64 pixel blocks, etc.). After the torrent of pixels has been brokendown into chewable chunks, they are fed into the newly programmed gatearray (or other hardware), which performs the elemental image processingtasks associated with the selected applications. (Such arrangements arefurther detailed below, in an exemplary system employing “pixelpackets.”) Various output products are sent to a routing engine, whichrefers the elementally-processed data (e.g., keyvector data) to otherresources (internal and/or external) for further processing. Thisfurther processing typically is more complex that that alreadyperformed. Examples include making associations, deriving inferences,pattern and template matching, etc. This further processing can behighly application-specific.

(Consider a promotional game from Pepsi, inviting the public toparticipate in a treasure hunt in a state park. Based oninternet-distributed clues, people try to find a hidden six-pack of sodato earn a $500 prize. Participants must download a special applicationfrom the Pepsi-dot-com web site (or the Apple AppStore), which serves todistribute the clues (which may also be published to Twitter). Thedownloaded application also has a prize verification component, whichprocesses image data captured by the users' cell phones to identify aspecial pattern with which the hidden six-pack is uniquely marked. SIFTobject recognition is used (discussed below), with the SIFT featuredescriptors for the special package conveyed with the downloadedapplication. When an image match is found, the cell phone immediatelyreports same wirelessly to Pepsi. The winner is the user whose cellphone first reports detection of the specially-marked six-pack. In theFIG. 6 arrangement, some of the component tasks in the SIFT patternmatching operation are performed by the elemental image processing inthe configured hardware; others are referred for more specializedprocessing—either internal or external.)

FIG. 7 up-levels the picture to a generic distributed pixel servicesnetwork view, where local device pixel services and “cloud based” pixelservices have a kind of symmetry in how they operate. The router in FIG.7 takes care of how any given packaged pixel packet gets sent to theappropriate pixel processing location, whether local or remote (with thestyle of fill pattern denoting different component processing functions;only a few of the processing functions required by the enabled visualprocessing services are depicted). Some of the data shipped tocloud-based pixel services may have been first processed by local devicepixel services. The circles indicate that the routing functionality mayhave components in the cloud—nodes that serve to distribute tasks toactive service providers, and collect results for transmission back tothe device. In some implementations these functions may be performed atthe edge of the wireless network, e.g., by modules at wireless servicetowers, so as to ensure the fastest action. Results collected from theactive external service providers, and the active local processingstages, are fed back to Pixel Service Manager software, which theninteracts with the device user interface.

FIG. 8 is an expanded view of the lower right portion of FIG. 7 andrepresents the moment where Dorothy's shoes turn red and why distributedpixel services provided by the cloud—as opposed to the local device—willprobably trump all but the most mundane object recognition tasks.

Object recognition in its richer form is based on visual associationrather than strict template matching rules. If we all were taught thatthe capital letter “A” will always be strictly following somepre-historic form never to change, a universal template image if youwill, then pretty clean and locally prescriptive methods can be placedinto a mobile imaging device in order to get it to reliably read acapital A any time that ordained form “A” is presented to the camera. 2Dand even three 3D barcodes in many ways follow this template-likeapproach to object recognition, where for contained applicationsinvolving such objects, local processing services can largely get thejob done. But even in the barcode example, flexibility in the growth andevolution of overt visual coding targets begs for an architecture whichdoesn't force “code upgrades” to a gazillion devices every time there issome advance in the overt symbology art.

At the other end of the spectrum, arbitrarily complex tasks can beimagined, e.g., referring to a network of supercomputers the task ofpredicting the apocryphal typhoon resulting from the fluttering of abutterfly's wings halfway around the world—if the application requiresit. Oz beckons.

FIG. 8 attempts to illustrate this radical extra dimensionality of pixelprocessing in the cloud as opposed to the local device. This virtuallygoes without saying (or without a picture), but FIG. 8 is also a seguefigure to FIG. 9, where Dorothy gets back to Kansas and is happy aboutit.

FIG. 9 is all about cash, cash flow, and happy humans using cameras ontheir mobile devices and getting highly meaningful results back fromtheir visual queries, all the while paying one monthly bill. It turnsout the Google “AdWords” auction genie is out of the bottle. Behind thescenes of the moment-by-moment visual scans from a mobile user of theirimmediate visual environment are hundreds and thousands ofmicro-decisions, pixel routings, results comparisons and micro-auctionedchannels back to the mobile device user for the hard good they are“truly” looking for, whether they know it or not. This last point isdeliberately cheeky, in that searching of any kind is inherently openended and magical at some level, and part of the fun of searching in thefirst place is that surprisingly new associations are part of theresults. The search user knows after the fact what they were trulylooking for. The system, represented in FIG. 9 as the carrier-basedfinancial tracking server, now sees the addition of our networked pixelservices module and its role in facilitating pertinent results beingsent back to a user, all the while monitoring the uses of the servicesin order to populate the monthly bill and send the proceeds to theproper entities.

(As detailed further elsewhere, the money flow may not exclusively be toremote service providers. Other money flows can arise, such as to usersor other parties, e.g., to induce or reward certain actions.)

FIG. 10 focuses on functional division of processing—illustrating howtasks in the nature of template matching can be performed on the cellphone itself, whereas more sophisticated tasks (in the nature of dataassociation) desirably are referred to the cloud for processing.

Elements of the foregoing are distilled in FIG. 10A, showing animplementation of aspects of the technology as a physical matter of(usually) software components. The two ovals in the figure highlight thesymmetric pair of software components which are involved in setting up a“human real-time” visual recognition session between a mobile device andthe generic cloud or service providers, data associations and visualquery results. The oval on the left refers to “keyvectors” and morespecifically “visual keyvectors.” As noted, this term can encompasseverything from simple JPEG compressed blocks all the way throughlog-polar transformed facial feature vectors and anything in between andbeyond. The point of a keyvector is that the essential raw informationof some given visual recognition task has been optimally pre-processedand packaged (possibly compressed). The oval on the left assembles thesepackets, and typically inserts some addressing information by which theywill be routed. (Final addressing may not be possible, as the packet mayultimately be routed to remote service providers—the details of whichmay not yet be known.) Desirably, this processing is performed as closeto the raw sensor data as possible, such as by processing circuitryintegrated on the same substrate as the image sensor, which isresponsive to software instructions stored in memory or provided fromanother stage in packet form.

The oval on the right administers the remote processing of keyvectordata, e.g., attending to arranging appropriate services, directingtraffic flow, etc. Desirably, this software process is implemented aslow down on a communications stack as possible, generally on a “cloudside” device, access point, or cell tower. (When real-time visualkeyvector packets stream over a communicanons channel, the lower down inthe communications stack they are identified and routed, the smootherthe “human real-time” look and feel a given visual recognition task willbe.) Remaining high level processing needed to support this arrangementis included in FIG. 10A for context, and can generally be performedthrough native mobile and remote hardware capabilities.

FIGS. 11 and 12 illustrate the concept that some providers of somecloud-based pixel processing services may be established in advance, ina pseudo-static fashion, whereas other providers may periodically viefor the privilege of processing a user's keyvector data, throughparticipation in a reverse auction. In many implementations, theselatter providers compete each time a packet is available for processing.

Consider a user who snaps a cell phone picture of an unfamiliar car,wanting to learn the make and model. Various service providers maycompete for this business. A startup vendor may offer to performrecognition for free—to build its brand or collect data. Imagerysubmitted to this service returns information simply indicating thecar's make and model. Consumer Reports may offer an alternativeservice—which provides make and model data, but also provides technicalspecifications for the car. However, they may charge 2 cents for theservice (or the cost may be bandwidth based, e.g., 1 cent permegapixel). Edmunds, or JD Powers, may offer still another service,which provides data like Consumer Reports, but pays the user for theprivilege of providing data. In exchange, the vendor is given the rightto have one of its partners send a text message to the user promotinggoods or services. The payment may take the form of a credit on theuser's monthly cell phone voice/data service billing.

Using criteria specified by the user, stored preferences, context, andother rules/heuristics, a query router and response manager (in the cellphone, in the cloud, distributed, etc.) determines whether the packet ofdata needing processing should be handled by one of the serviceproviders in the stable of static standbys, or whether it should beoffered to providers on an auction basis—in which case it arbitrates theoutcome of the auction.

The static standby service providers may be identified when the phone isinitially programmed, and only reconfigured when the phone isreprogrammed. (For example, Verizon may specify that all FFT operationson its phones be routed to a server that it provides for this purpose.)Or, the user may be able to periodically identify preferred providersfor certain tasks, as through a configuration menu, or specify thatcertain tasks should be referred for auction. Some applications mayemerge where static service providers are favored; the task may be somundane, or one provider's services may be so un-paralleled, thatcompetition for the provision of services isn't warranted.

In the case of services referred to auction, some users may exalt priceabove all other considerations. Others may insist on domestic dataprocessing. Others may want to stick to service providers that meet“green,” “ethical,” or other standards of corporate practice. Others mayprefer richer data output. Weightings of different criteria can beapplied by the query router and response manager in making the decision.

In some circumstances, one input to the query router and responsemanager may be the user's location, so that a different service providermay be selected when the user is at home in Oregon, than when she isvacationing in Mexico. In other instances, the required turnaround timeis specified, which may disqualify some vendors, and make others morecompetitive. In some instances the query router and response managerneed not decide at all, e.g., if cached results identifying a serviceprovider selected in a previous auction are still available and notbeyond a “freshness” threshold.

Pricing offered by the vendors may change with processing load,bandwidth, time of day, and other considerations. In some embodimentsthe providers may be informed of offers submitted by competitors (usingknown trust arrangements assuring data integrity), and given theopportunity to make their offers more enticing. Such a bidding war maycontinue until no bidder is willing to change the offered terms. Thequery router and response manager (or in some implementations, the user)then makes a selection.

For expository convenience and visual clarity, FIG. 12 shows a softwaremodule labeled “Bid Filter and Broadcast Agent.” In most implementationsthis forms part of the query router and response manager module. The bidfilter module decides which vendors—from a universe of possiblevendors—should be given a chance to bid on a processing task. (Theuser's preference data, or historical experience, may indicate thatcertain service providers be disqualified.) The broadcast agent modulethen communicates with the selected bidders to inform them of a usertask for processing, and provides information needed for them to make abid.

Desirably, the bid filter and broadcast agent do at least some theirwork in advance of data being available for processing. That is, as soonas a prediction can be made as to an operation that the user may likelysoon request, these modules start working to identify a provider toperform a service expected to be required. A few hundred millisecondslater the user keyvector data may actually be available for processing(if the prediction turns out to be accurate).

Sometimes, as with Google's present AdWords system, the serviceproviders are not consulted at each user transaction. Instead, eachprovides bidding parameters, which are stored and consulted whenever atransaction is considered, to determine which service provider wins.These stored parameters may be updated occasionally. In someimplementations the service provider pushes updated parameters to thebid filter and broadcast agent whenever available. (The bid filter andbroadcast agent may serve a large population of users, such as allVerizon subscribers in area code 503, or all subscribers to an ISP in acommunity, or all users at the domain well-dot-com, etc.; or morelocalized agents may be employed, such as one for each cell phonetower.)

If there is a lull in traffic, a service provider may discount itsservices for the next minute. The service provider may thus transmit (orpost) a message stating that it will perform eigenvector extraction onan image file of up to 10 megabytes for 2 cents until 1244754176Coordinated Universal Time in the Unix epoch, after which time the pricewill return to 3 cents. The bid filter and broadcast agent updates atable with stored bidding parameters accordingly.

(Information about the Google reverse auction, used to place sponsoredadvertising on web search results page, is reproduced at the end of theprovisional specification to which this application claims priority.This information was published in Wired magazine on May 22, 2009, in anarticle by Stephen Levy, entitled “Secret of Googlenomics: Data-FueledRecipe Brews Profitability.”)

In other implementations, the broadcast agent polls thebidders—communicating relevant parameters, and soliciting bid responseswhenever a transaction is offered for processing.

Once a prevailing bidder is decided, and data is available forprocessing, the broadcast agent transmits the keyvector data (and otherparameters as may be appropriate to a particular task) to the winningbidder. The bidder then performs the requested operation, and returnsthe processed data to the query router and response manager. This modulelogs the processed data, and attends to any necessary accounting (e.g.,crediting the service provider with the appropriate fee). The responsedata is then forwarded back to the user device.

In a variant arrangement, one or more of the competing service providersactually performs some or all of the requested processing, but “teases”the user (or the query router and response manager) by presenting onlypartial results. With a taste of what's available, the user (or thequery router and response manager) may be induced to make a differentchoice than relevant criteria/heuristics would otherwise indicate.

The function calls sent to external service providers, of course, do nothave to provide the ultimate result sought by a consumer (e.g.,identifying a car, or translating a menu listing from French toEnglish). They can be component operations, such as calculating an FFT,or performing a SIFT procedure or a log-polar transform, or computing ahistogram or eigenvectors, or identifying edges, etc.

In time, it is expected that a rich ecosystem of expert processors willemerge—serving myriad processing requests from cell phones and otherthin client devices.

More on Monetary Flow

Additional business models can be enabled, involving the subsidizationof consumed remote services by the service providers themselves inexchange for user information (e.g., for audience measurement), or inexchange for action taken by the user, such as completing a survey,visiting specific sites, locations in store, etc.

Services may be subsidized by third parties as well, such a coffee shopthat derives value by providing a differentiating service to itscustomers in the form of free/discounted usage of remote services whilethey are seated in the shop.

In one arrangement an economy is enabled wherein a currency of remoteprocessing credits is created and exchanged between users and remoteservice providers. This may be entirely transparent to the user andmanaged as part of a service plan, e.g., with the user's cell phone ordata service provider. Or it can be exposed as a very explicit aspect ofthe present technology. Service providers and others may award creditsto users for taking actions or being part of a frequent-user program tobuild allegiance with specific providers.

As with other currencies, users may choose to explicitly donate, save,exchange or generally barter credits as needed.

Considering these points in further detail, a service may pay a user foropting-in to an audience measurement panel. E.g., The Nielsen Companymay provide services to the public—such as identification of televisionprogramming from audio or video samples submitted by consumers. Theseservices may be provided free to consumers who agree to share some oftheir media consumption data with Nielsen (such as by serving as ananonymous member for a city's audience measurement panel), and providedon a fee basis to others. Nielsen may offer, for example, 100 units ofcredit—micropayments or other value—to participating consumers eachmonth, or may provide credit each time the user submits information toNielsen.

In another example, a consumer may be rewarded for acceptingcommercials, or commercial impressions, from a company. If a consumergoes into the Pepsi Center in Denver, she may receive a reward for eachPepsi-branded experience she encounters. The amount of micropayment mayscale with the amount of time that she interacts with the differentPepsi-branded objects (including audio and imagery) in the venue.

Not just large brand owners can provide credits to individuals. Creditscan be routed to friends and social/business acquaintances. Toillustrate, a user of Facebook may share credit (redeemable forgoods/services, or exchangeable for cash) from his Facebookpage—enticing others to visit, or linger. In some cases, the credit canbe made available only to people who navigate to the Facebook page in acertain manner—such as by linking to the page from the user's businesscard, or from another launch page.

As another example, consider a Facebook user who has earned, or paidfor, or otherwise received credit that can be applied to certainservices—such as for downloading songs from iTunes, or for musicrecognition services, or for identifying clothes that go with particularshoes (for which an image has been submitted), etc. These services maybe associated with the particular Facebook page, so that friends caninvoke the services from that page—essentially spending the host'scredit (again, with suitable authorization or invitation by that hostinguser). Likewise, friends may submit images to a facial recognitionservice accessible through an application associated with the user'sFacebook page. Images submitted in such fashion are analyzed for facesof the host's friends, and identification information is returned to thesubmitter, e.g., through a user interface presented on the originatingFacebook page. Again, the host may be assessed a fee for each suchoperation, but may allow authorized friends to avail themselves of suchservice at no cost.

Credits, and payments, can also be routed to charities. A viewer exitinga theatre after a particularly poignant movie about poverty inBangladesh may capture an image of an associated movie poster, whichserves as a portal for donations for a charity that serves the poor inBangladesh. Upon recognizing the movie poster, the cell phone canpresent a graphical/touch user interface through which the user spinsdials to specify an amount of a charitable donation, which at theconclusion of the transaction is transferred from a financial accountassociated with the user, to one associated with the charity.

More on a Particular Hardware Arrangement

As noted above and in the cited patent documents, there is a need forgeneric object recognition by a mobile device. Some approaches tospecialized object recognition have emerged, and these have given riseto specific data processing approaches. However, no architecture hasbeen proposed that goes beyond specialized object recognition towardgeneric object recognition.

Visually, a generic object recognition arrangement requires access togood raw visual data—preferably free of device quirks, scene quirks,user quirks, etc. Developers of systems built around objectidentification will best prosper and serve their users by concentratingon the object identification task at hand, and not the myriad existingroadblocks, resource sinks, and third party dependencies that currentlymust be confronted.

As noted, virtually all object identification techniques can make useof—or even rely upon—a pipe to “the cloud.”

“Cloud” can include anything external to the cell phone. An example is anearby cell phone, or plural phones on a distributed network. Unusedprocessing power on such other phone devices can be made available forhire (or for free) to call upon as needed. The cell phones of theimplementations detailed herein can scavenge processing power from suchother cell phones.

Such a cloud may be ad hoc, e.g., other cell phones within Bluetoothrange of the user's phone. The ad hoc network can be extended by havingsuch other phones also extend the local cloud to further phones thatthey can reach by Bluetooth, but the user cannot.

The “cloud” can also comprise other computational platforms, such asset-top boxes; processors in automobiles, thermostats, HVAC systems,wireless routers, local cell phone towers and other wireless networkedges (including the processing hardware for their software-definedradio equipment), etc. Such processors can be used in conjunction withmore traditional cloud computing resources—as are offered by Google,Amazon, etc.

(In view of concerns of certain users about privacy, the phone desirablyhas a user-configurable option indicating whether the phone can referdata to cloud resources for processing. In one arrangement, this optionhas a default value of “No,” limiting functionality and impairingbattery life, but also limiting privacy concerns. In anotherarrangement, this option has a default value of “Yes.”)

Desirably, image-responsive techniques should produce a short term“result or answer,” which generally requires some level of interactivitywith a user—hopefully measured in fractions of a second for trulyinteractive applications, or a few seconds or fractions of a minute fornearer-term “I'm patient to wait” applications.

As for the objects in question, they can break down into variouscategories, including (1) generic passive (clues to basic searches), (2)geographic passive (at least you know where you are, and may hook intogeographic-specific resources), (3) “cloud supported” passive, as with“identified/enumerated objects” and their associated sites, and (4)active/controllable, a la ThingPipe (a reference to technology detailedin application Ser. No. 12/498,709, such as WiFi-equipped thermostatsand parking meters).

An object recognition platform should not, it seems, be conceived in theclassic “local device and local resources only” software mentality.However, it may be conceived as a local device optimization problem.That is, the software on the local device, and its processing hardware,should be designed in contemplation of their interaction with off-devicesoftware and hardware. Ditto the balance and interplay of both controlfunctionality, pixel crunching functionality, and applicationsoftware/GUI provided on the device, versus off the device. (In manyimplementations, certain databases useful for objectidentification/recognition will reside remote from the device.)

In a particularly preferred arrangement, such a processing platformemploys image processing near the sensor—optimally on the same chip,with at least some processing tasks desirably performed by dedicated,special purpose hardware.

Consider FIG. 13, which shows an architecture of a cell phone 10 inwhich an image sensor 12 feeds two processing paths. One, 13, istailored for the human visual system, and includes processing such asJPEG compression. Another, 14, is tailored for object recognition. Asdiscussed, some of this processing may be performed by the mobiledevice, while other processing may be referred to the cloud 16.

FIG. 14 takes an application-centric view of the object recognitionprocessing path. Some applications reside wholly in the cell phone.Other applications reside wholly outside the cell phone—e.g., simplytaking keyvector data as stimulus. More common are hybrids, such aswhere some processing is done in the cell phone, other processing isdone externally, and the application software orchestrating the processresides in the cell phone.

To illustrate further discussion, FIG. 15 shows a range 40 of some ofthe different types of images 41-46 that may be captured by a particularuser's cell phone. A few brief (and incomplete) comments about some ofthe processing that may be applied to each image are provided in thefollowing paragraphs.

Image 41 depicts a thermostat. A steganographic digital watermark 47 istextured or printed on the thermostat's case. (The watermark is shown asvisible in FIG. 15, but is typically imperceptible to the viewer). Thewatermark conveys information intended for the cell phone, allowing itto present a graphic user interface by which the user can interact withthe thermostat. A bar code or other data carrier can alternatively beused. Such technology is further detailed below, and in patentapplication Ser. No. 12/498,709, filed Apr. 14, 2009.

Image 42 depicts an item including a barcode 48. This barcode conveysUniversal Product Code (UPC) data. Other barcodes may convey otherinformation. The barcode payload is not primarily intended for readingby a user cell phone (in contrast to watermark 47), but it nonethelessmay be used by the cell phone to help determine an appropriate responsefor the user.

Image 43 shows a product that may be identified without reference to anyexpress machine readable information (such as a bar code or watermark).A segmentation algorithm may be applied to edge-detected image data todistinguish the apparent image subject from the apparent background. Theimage subject may be identified through its shape, color and texture.Image fingerprinting may be used to identify reference images havingsimilar labels, and metadata associated with those other images may beharvested. SIFT techniques (discussed below) may be employed for suchpattern-based recognition tasks. Specular reflections in low textureregions may tend to indicate the image subject is made of glass. Opticalcharacter recognition can be applied for further information (readingthe visible text). All of these clues can be employed to identify thedepicted item, and help determine an appropriate response for the user.

Additionally (or alternatively), similar-image search systems, such asGoogle Similar Images, and Microsoft Live Search, can be employed tofind similar images, and their metadata can then be harvested. (As ofthis writing, these services do not directly support upload of a userpicture to find similar web pictures. However, the user can post theimage to Flickr (using Flickr's cell phone upload functionality), and itwill soon be found and processed by Google and Microsoft.)

Image 44 is a snapshot of friends. Facial detection and recognition maybe employed (i.e., to indicate that there are faces in the image, and toidentify particular faces and annotate the image with metadataaccordingly, e.g., by reference to user-associated data maintained byApple's iPhoto service, Google's Picasa service, Facebook, etc.) Somefacial recognition applications can be trained for non-human faces,e.g., cats, dogs animated characters including avatars, etc. Geolocationand date/time information from the cell phone may also provide usefulinformation.

The persons wearing sunglasses pose a challenge for some facialrecognition algorithms. Identification of those individuals may be aidedby their association with persons whose identities can more easily bedetermined (e.g., by conventional facial recognition). That is, byidentifying other group pictures in iPhoto/Picasa/Facebook/etc. thatinclude one or more of the latter individuals, the other individualsdepicted in such photographs may also be present in the subject image.These candidate persons form a much smaller universe of possibilitiesthan is normally provided by unbounded iPhoto/Picasa/Facebook/etc data.The facial vectors discernable from the sunglass-wearing faces in thesubject image can then be compared against this smaller universe ofpossibilities in order to determine a best match. If, in the usual caseof recognizing a face, a score of 90 is required to be considered amatch (out of an arbitrary top match score of 100), in searching such agroup-constrained set of images a score of 70 or 80 might suffice.(Where, as in image 44, there are two persons depicted withoutsunglasses, the occurrence of both of these individuals in a photo withone or more other individuals may increase its relevance to such ananalysis—implemented, e.g., by increasing a weighting factor in amatching algorithm.)

Image 45 shows part of the statue of Prometheus in Rockefeller Center,NY. Its identification can follow teachings detailed elsewhere in thisspecification.

Image 46 is a landscape, depicting the Maroon Bells mountain range inColorado. This image subject may be recognized by reference togeolocation data from the cell phone, in conjunction with geographicinformation services such as GeoNames or Yahoo!'s GeoPlanet.

(It should be understood that techniques noted above in connection withprocessing of one of the images 41-46 in FIG. 15 can likewise be appliedto others of the images. Moreover, it should be understood that while insome respects the depicted images are ordered according to ease ofidentifying the subject and formulating a response, in other respectsthey are not. For example, although landscape image 46 is depicted tothe far right, its geolocation data is strongly correlated with themetadata “Maroon Bells.” Thus, this particular image presents an easiercase than that presented by many other images.)

In one embodiment, such processing of imagery occursautomatically—without express user instruction each time. Subject tonetwork connectivity and power constraints, information can be gleanedcontinuously from such processing, and may be used in processingsubsequently-captured images. For example, an earlier image in asequence that includes photograph 44 may show members of the depictedgroup without sunglasses—simplifying identification of the persons laterdepicted with sunglasses.

FIG. 16, Etc., Implementation

FIG. 16 gets into the nifty gritty of a particularimplementation—incorporating certain of the features earlier discussed.(The other discussed features can be implemented by the artisan withinthis architecture, based on the provided disclosure.) In this datadriven arrangement 30, operation of a cell phone camera 32 isdynamically controlled in accordance with packet data sent by a setupmodule 34, which in turn is controlled by a control processor module 36.(Control processor module 36 may be the cell phone's primary processor,or an auxiliary processor, or this function may be distributed.) Thepacket data further specifies operations to be performed by an ensuingchain of processing stages 38.

In one particular implementation, setup module 34 dictates—on a frame byframe basis—the parameters that are to be employed by camera 32 ingathering an exposure. Setup module 34 also specifies the type of datathe camera is to output. These instructional parameters are conveyed ina first field 55 of a header portion 56 of a data packet 57corresponding to that frame (FIG. 17).

For example, for each frame, the setup module 34 may issue a packet 57whose first field 55 instructs the camera about, e.g., the length of theexposure, the aperture size, the lens focus, the depth of field, etc.Module 34 may further author the field 55 to specify that the sensor isto sum sensor charges to reduce resolution (e.g., producing a frame of640×480 data from a sensor capable of 1280×960), output data only fromred-filtered sensor cells, output data only from a horizontal line ofcells across the middle of the sensor, output data only from a 128×128patch of cells in the center of the pixel array, etc. The camerainstruction field 55 may further specify the exact time that the camerais to capture data—so as to allow, e.g., desired synchronization withambient lighting (as detailed later).

Each packet 56 issued by setup module 34 may include different cameraparameters in the first header field 55. Thus, a first packet may causecamera 32 to capture a full frame image with an exposure time of 1millisecond. A next packet may cause the camera to capture a full frameimage with an exposure time of 10 milliseconds, and a third may dictatean exposure time of 100 milliseconds. (Such frames may later beprocessed in combination to yield a high dynamic range image.) A fourthpacket may instruct the camera to down-sample data from the imagesensor, and combine signals from differently color-filtered sensorcells, so as to output a 4×3 array of grayscale luminance values. Afifth packet may instruct the camera to output data only from an 8×8patch of pixels at the center of the frame. A sixth packet may instructthe camera to output only five lines of image data, from the top,bottom, middle, and mid-upper and mid-lower rows of the sensor. Aseventh packet may instruct the camera to output only data fromblue-filtered sensor cells. An eighth packet may instruct the camera todisregard any auto-focus instructions but instead capture a full frameat infinity focus. And so on.

Each such packet 57 is provided from setup module 34 across a bus orother data channel 60 to a camera controller module associated with thecamera. (The details of a digital camera—including an array ofphotosensor cells, associated analog-digital converters and controlcircuitry, etc., are well known to artisans and so are not belabored.)Camera 32 captures digital image data in accordance with instructions inthe header field 55 of the packet and stuffs the resulting image datainto a body 59 of the packet. It also deletes the camera instructions 55from the packet header (or otherwise marks header field 55 in a mannerpermitting it to be disregarded by subsequent processing stages).

When the packet 57 was authored by setup module 34 it also included aseries of further header fields 58, each specifying how a corresponding,successive post-sensor stage 38 should process the captured data. Asshown in FIG. 16, there are several such post-sensor processing stages38.

Camera 32 outputs the image-stuffed packet produced by the camera (apixel packet) onto a bus or other data channel 61, which conveys it to afirst processing stage 38.

Stage 38 examines the header of the packet. Since the camera deleted theinstruction field 55 that conveyed camera instructions (or marked it tobe disregarded), the first header field encountered by a control portionof stage 38 is field 58 a. This field details parameters of an operationto be applied by stage 38 to data in the body of the packet.

For example, field 58 a may specify parameters of an edge detectionalgorithm to be applied by stage 38 to the packet's image data (orsimply that such an algorithm should be applied). It may further specifythat stage 38 is to substitute the resulting edge-detected set of datafor the original image data in the body of the packet. (Substituting ofdata, rather than appending, may be indicated by the value of a singlebit flag in the packet header.) Stage 38 performs the requestedoperation (which may involve configuring programmable hardware incertain implementations). First stage 38 then deletes instructions 58 afrom the packet header 56 (or marks them to be disregarded) and outputsthe processed pixel packet for action by a next processing stage.

A control portion of a next processing stage (which here comprisesstages 38 a and 38 b, discussed later) examines the header of thepacket. Since field 58 a was deleted (or marked to be disregarded), thefirst field encountered is field 58 b. In this particular packet, field58 b may instruct the second stage not to perform any processing on thedata in the body of the packet, but instead simply delete field 58 bfrom the packet header and pass the pixel packet to the next stage.

A next field of the packet header may instruct the third stage 38 c toperform 2D FFT operations on the image data found in the packet body,based on 16×16 blocks. It may further direct the stage to hand-off theprocessed FFT data to a wireless interface, for internet transmission toaddress 216.239.32.10, accompanied by specified data (detailing, e.g.,the task to be performed on the received FFT data by the computer atthat address, such as texture classification). It may further direct thestage to hand off a single 16×16 block of FFT data, corresponding to thecenter of the captured image, to the same or a different wirelessinterface for transmission to address 12.232.235.27—again accompanied bycorresponding instructions about its use (e.g., search an archive ofstored FFT data for a match, and return information if a match is found;also, store this 16×16 block in the archive with an associatedidentifier). Finally, the header authored by setup module 34 mayinstruct stage 38 c to replace the body of the packet with the single16×16 block of FFT data dispatched to the wireless interface. As before,the stage also edits the packet header to delete (or mark) theinstructions to which it responded, so that a header instruction fieldfor the next processing stage is the first to be encountered.

In other arrangements, the addresses of the remote computers are nothard-coded. For example, the packet may include a pointer to a databaserecord or memory location (in the phone or in the cloud), which containsthe destination address. Or, stage 38 c may be directed to hand-off theprocessed pixel packet to the Query Router and Response Manager (e.g.,FIG. 7). This module examines the pixel packet to determine what type ofprocessing is next required, and it routes it to an appropriate provider(which may be in the cell phone if resources permit, or in thecloud—among the stable of static providers, or to a provider identifiedthrough an auction). The provider returns the requested output data(e.g., texture classification information, and information about anymatching FFT in the archive), and processing continues per the next itemof instruction in the pixel packet header.

The data flow continues through as many functions as a particularoperation may require.

In the particular arrangement illustrated, each processing stage 38strips-out, from the packet header, the instructions on which it acted.The instructions are ordered in the header in the sequence of processingstages, so this removal allows each stage to look to the firstinstructions remaining in the header for direction. Other arrangements,of course, can alternatively be employed. (For example, a module mayinsert new information into the header—at the front, tail, or elsewherein the sequence—based on processing results. This amended header thencontrols packet flow and therefore processing.)

In addition to outputting data for the next stage, each stage 38 mayfurther have an output 31 providing data back to the control processormodule 36. For example, processing undertaken by one of the local stages38 may indicate that the exposure or focus of the camera should beadjusted to optimize suitability of an upcoming frame of captured datafor a particular type of processing (e.g., object identification). Thisfocus/exposure information can be used as predictive setup data for thecamera the next time a frame of the same or similar type is captured.The control processor module 36 can set up a frame request using afiltered or time-series prediction sequence of focus information fromprevious frames, or a sub-set of those frames.

Error and status reporting functions may also be accomplished usingoutputs 31. Each stage may also have one or more other outputs 33 forproviding data to other processes or modules—locally within the cellphone, or remote (“in the cloud”). Data (in packet form, or in otherformat) may be directed to such outputs in accordance with instructionsin packet 57, or otherwise.

For example, a processing module 38 may make a data flow selection basedon some result of processing it performs. E.g., if an edge detectionstage discerns a sharp contrast image, then an outgoing packet may berouted to an external service provider for FFT processing. That providermay return the resultant FFT data to other stages. However, if the imagehas poor edges (such as being out of focus), then the system may notwant FFT—and following processing to be performed on the data. Thus, theprocessing stages can cause branches in the data flow, dependent onparameters of the processing (such as discerned image characteristics).

Instructions specifying such conditional branching can be included inthe header of packet 57, or they can be provided otherwise. FIG. 19shows one arrangement. Instructions 58 d originally in packet 57 specifya condition, and specify a location in a memory 79 from whichreplacement subsequent instructions (58 e′-58 g′) can be read, andsubstituted into the packet header, if the condition is met. If thecondition is not met, execution proceeds in accordance with headerinstructions already in the packet.

In other arrangements, other variations can be employed. For example,all of the possible conditional instructions can be provided in thepacket. In another arrangement, a packet architecture is still used, butone or more of the header fields do not include explicit instructions.Rather, they simply point to memory locations from which correspondinginstructions (or data) are retrieved, e.g., by the correspondingprocessing stage 38.

Memory 79 (which can include a cloud component) can also facilitateadaptation of processing flow even if conditional branching is notemployed. For example, a processing stage may yield output data thatdetermines parameters of a filter or other algorithm to be applied by alater stage (e.g., a convolution kernel, a time delay, a pixel mask,etc). Such parameters may be identified by the former processing stagein memory (e.g., determined/calculated, and stored), and recalled foruse by the later stage. In FIG. 19, for example, processing stage 38produces parameters that are stored in memory 79. A subsequentprocessing stage 38 c later retrieves these parameters, and uses them inexecution of its assigned operation. (The information in memory can belabeled to identify the module/provider from which they originated, orto which they are destined <if known>, or other addressing arrangementscan be used.) Thus, the processing flow can adapt to circumstances andparameters that were not known at the time control processor module 36originally directed setup module 34 to author packet 57.

In one particular embodiment, each of the processing stages 38 compriseshardware circuitry dedicated to a particular task. The first stage 38may be a dedicated edge-detection processor. The third stage 38 c may bea dedicated FFT processor. Other stages may be dedicated to otherprocesses. These may include DCT, wavelet, Haar, Hough, andFourier-Mellin transform processors, filters of different sorts (e.g.,Wiener, low pass, bandpass, highpass), and stages for performing all orpart of operations such as facial recognition, optical characterrecognition, computation of eigenvalues, extraction of shape, color andtexture feature data, barcode decoding, watermark decoding, objectsegmentation, pattern recognition, age and gender detection, emotionclassification, orientation determination, compression, decompression,log-polar mapping, convolution, interpolation,decimation/down-sampling/anti-aliasing; correlation, performingsquare-root and squaring operations, array multiplication, perspectivetransformation, butterfly operations (combining results of smaller DFTsinto a larger DFT, or decomposing a larger DCT into subtransforms), etc.

These hardware processors can be field-configurable, instead ofdedicated. Thus, each of the processing blocks in FIG. 16 may bedynamically reconfigurable, as circumstances warrant. At one instant ablock may be configured as an FFT processing module. The next instant itmay be configured as a filter stage, etc. One moment the hardwareprocessing chain may be configured as a barcode reader; the next it maybe configured as a facial recognition system, etc.

Such hardware reconfiguration information can be downloaded from thecloud, or from services such as the Apple AppStore. And the informationneedn't be statically resident on the phone once downloaded—it can besummoned from the cloud/AppStore whenever needed.

Given increasing broadband availability and speed, the hardwarereconfiguration data can be downloaded to the cell phone each time it isturned on, or otherwise initialized—or whenever a particular function isinitialized. Gone would be the dilemma of dozens of different versionsof an application being deployed in the market at any giventime—depending on when different users last downloaded updates, and theconundrums that companies confront in supporting disparate versions ofproducts in the field. Each time a device or application is initialized,the latest version of all or selected functionalities is downloaded tothe phone. And this works not just for full system functionality, butalso components, such as hardware drivers, software for hardware layers,etc. At each initialization, hardware is configured anew—with the latestversion of applicable instructions. (For code used during initializing,it can be downloaded for use at the next initialization.) Some updatedcode may be downloaded and dynamically loaded only when particularapplications require it, such as to configure the hardware of FIG. 6 forspecialized functions. The instructions can also be tailored to theparticular platforms, e.g., the iPhone device may employ differentaccelerometers than the Android device, and application instructions maybe varied accordingly.

In some embodiments, the respective purpose processors may be chained ina fixed order. The edge detection processor may be first, the FFTprocessor may be third, and so on.

Alternatively, the processing modules may be interconnected by one ormore busses (and/or a crossbar arrangement or other interconnectionarchitecture) that permit any stage to receive data from any stage, andto output data to any stage. Another interconnect method is a network ona chip (effectively a packet-based LAN; similar to crossbar inadaptability, but programmable by network protocols). Such arrangementscan also support having one or more stages iteratively processdata—taking output as input, to perform further processing.

One iterative processing arrangement is shown by stages 38 a/38 b inFIG. 16. Output from stage 38 a can be taken as input to stage 38 b.Stage 38 b can be instructed to do no processing on the data, but simplyapply it again back to the input of stage 38 a. This can loop as manytimes as desired. When iterative processing by stage 38 a is completed,its output can be passed to a next stage 38 c in the chain.

In addition to simply serving as a pass-through stage, stage 38 b canperform its own type of processing on the data processed by stage 38 a.Its output can be applied to the input of stage 38 a. Stage 38 a can beinstructed to apply, again, its process to the data produced by stage 38b, or to pass it through. Any serial combination of stage 38 a/38 bprocessing can thus be achieved.

The roles of stages 38 a and 38 b in the foregoing can also be reversed.

In this fashion, stages 38 a and 38 b can be operated to (1) apply astage 38 a process one or more times to data; (2) apply a stage 38 bprocess one or more times to data; (3) apply any combination and orderof 38 a and 38 b processes to data; or (4) simply pass the input data tothe next stage, without processing.

The camera stage can be incorporated into an iterative processing loop.For example, to gain focus-lock, a packet may be passed from the camerato a processing module that assesses focus. (Examples may include an FFTstage—looking for high frequency image components; an edge detectorstage—looking for strong edges; etc. Sample edge detection algorithmsinclude Canny, Sobel, and differential. Edge detection is also usefulfor object tracking.) An output from such a processing module can loopback to the camera's controller module and vary a focus signal. Thecamera captures a subsequent frame with the varied focus signal, and theresulting image is again provided to the processing module that assessesfocus. This loop continues until the processing module reports focuswithin a threshold range is achieved. (The packet header, or a parameterin memory, can specify an iteration limit, e.g., specifying that theiterating should terminate and output an error signal if no focusmeeting the specified requirement is met within ten iterations.)

While the discussion has focused on serial data processing, image orother data may be processed in two or more parallel paths. For example,the output of stage 38 d may be applied to two subsequent stages, eachof which starts a respective branch of a fork in the processing. Thosetwo chains can be processed independently thereafter, or data resultingfrom such processing can be combined—or used in conjunction—in asubsequent stage. (Each of those processing chains, in turn, can beforked, etc.)

As noted, a fork commonly will appear much earlier in the chain. Thatis, in most implementations, a parallel processing chain will beemployed to produce imagery for human—as opposed to machine—consumption.Thus, a parallel process may fork immediately following the camerasensor 12, as shown by juncture 17 in FIG. 13. The processing for thehuman visual system 13 includes operations such as noise reduction,white balance, and compression. Processing for object identification 14,in contrast, may include the operations detailed in this specification.

When an architecture involves forked or other parallel processes, thedifferent modules may finish their processing at different times. Theymay output data as they finish—asynchronously, as the pipeline or otherinterconnection network permits. When the pipeline/network is free, anext module can transfer its completed results. Flow control may involvesome arbitration, such as giving one path or data a higher priority.Packets may convey priority data—determining their precedence in casearbitration is needed. For example, many image processingoperations/modules make use of Fourier domain data, such as produced byan FFT module. The output from an FFT module may thus be given a highpriority, and precedence over others in arbitrating data traffic, sothat the Fourier data that may be needed by other modules can be madeavailable with a minimum of delay.

In other implementations, some or all of the processing stages are notdedicated purpose processors, but rather are general purposemicroprocessors programmed by software. In still other implementations,the processors are hardware-reconfigurable. For example, some or all maybe field programmable gate arrays, such as Xilinx Virtex series devices.Alternatively they may be digital signal processing cores, such as TexasInstruments TMS320 series devices.

Other implementations can include PicoChip devices, such as the PC302and PC312 multicore DSPs. Their programming model allows each core to becoded independently (e.g., in C), and then to communicate with othersover an internal interconnect mesh. The associated tools particularlylend themselves to use of such processors in cellular equipment.

Still other implementations may employ configurable logic on an ASIC.For example, a processor can include a region of configurationlogic—mixed with dedicated logic. This allows configurable logic in apipeline, with dedicated pipeline or bus interface circuitry.

An implementation can also include one or more modules with a small CPUand RAM, with programmable code space for firmware, and workspace forprocessing—essentially a dedicated core. Such a module can performfairly extensive computations—configurable as needed by the process thatis using the hardware at the time.

All such devices can be deployed in a bus, crossbar or otherinterconnection architecture that again permits any stage to receivedata from, and output data to, any stage. (A FFT or other transformprocessor implemented in this fashion may be reconfigured dynamically toprocess blocks of 16×16, 64×64, 4096×4096, 1×64, 32×128, etc.) Incertain implementations, some processing modules arereplicated—permitting parallel execution on parallel hardware. Forexample, several FFTs may be processing simultaneously.

In a variant arrangement, a packet conveys instructions that serve toreconfigure hardware of one or more of the processing modules. As apacket enters a module, the header causes the module to reconfigure thehardware before the image-related data is accepted for processing. Thearchitecture is thus configured on the fly by packets (which may conveyimage related data, or not). The packets can similarly convey firmwareto be loaded into a module having a CPU core, or into an application- orcloud-based layer; likewise with software instructions.

The module configuration instructions may be received over a wireless orother external network; it needn't always be resident on the localsystem. If the user requests an operation for which local instructionsare not available, the system can request the configuration data from aremote source.

Instead of conveying the configuration data/instructions themselves, thepacket may simply convey an index number, pointer, or other addressinformation. This information can be used by the processing module toaccess a corresponding memory store from which the neededdata/instructions can be retrieved. Like a cache, if the local memorystore is not found to contain the needed data/instructions, they canthen be requested from another source (e.g., across an externalnetwork).

Such arrangements bring the dynamic of routability down to the hardwarelayer—configuring the module as data arrives at it.

Parallelism is widely employed in graphics processing units (GPUs). Manycomputer systems employ GPUs as auxiliary processors to handleoperations such as graphics rendering. Cell phones increasingly includeGPU chips to allow the phones to serve as gaming platforms; these can beemployed to advantage in implementations of the present technology. (Byway of example and not limitation, a GPU can be used to perform bilinearand bicubic interpolation, projective transformations, filtering, etc.)

In accordance with another aspect of the present technology, a GPU isused to correct for lens aberrations and other optical distortion.

Cell phone cameras often display optical non-linearities, such as barreldistortion, focus anomalies at the perimeter, etc. This is particularlya problem when decoding digital watermark information from capturedimagery. With a GPU, the image can be treated as a texture map, andapplied to a correction surface.

Typically, texture mapping is used to put a picture of bricks or a stonewall onto a surface, e.g., of a dungeon. Texture memory data isreferenced, and mapped onto a plane or polygon as it is drawn. In thepresent context it is the image that is applied to a surface. Thesurface is shaped so that the image is drawn with an arbitrary,correcting transform.

Steganographic calibration signals in a digitally watermarked image canbe used to discern the distortion by which an image has beentransformed. (See, e.g., Digimarc's U.S. Pat. No. 6,590,996.) Each patchof a watermarked image can be characterized by affine transformationparameters, such as translation and scale. An error function for eachlocation in the captured frame can thereby be derived. From this errorinformation, a corresponding surface can be devised which—when thedistorted image is projected onto it by the GPU, the surface causes theimage to appear in its counter-distorted, original form.

A lens can be characterized in this fashion with a reference watermarkimage. Once the associated correction surface has been devised, it canbe re-used with other imagery captured through that optical system(since the associated distortion is fixed). Other imagery can beprojected onto this correction surface by the GPU to correct the lensdistortion. (Different focal depths, and apertures, may requirecharacterization of different correction functions, since the opticalpath through the lens may be different.)

When a new image is captured, it can be initially rectilinearized, torid it of keystone/trapezoidal perspective effect. Once rectilinearized(e.g., re-squared relative to the camera lens), the local distortionscan be corrected by mapping the rectilinearized image onto thecorrection surface, using the GPU.

Thus, the correction model is in essence a polygon surface, where thetilts and elevations correspond to focus irregularities. Each region ofthe image has a local transform matrix allowing for correction of thatpiece of the image.

The same arrangement can likewise be used to correct distortion of alens in an image projection system. Before projection, the image ismapped—like a texture—onto a correction surface synthesized tocounteract lens distortion. When the thus-processed image is projectedthrough the lens, the lens distortion counteracts the correction surfacedistortion earlier applied, causing a corrected image to be projectedfrom the system.

Reference was made to the depth of field as one of the parameters thatcan be employed by camera 32 in gathering exposures. Although a lens canprecisely focus at only one distance, the decrease in sharpness isgradual on either side of the focused distance. (The depth of fielddepends on the point spread function of the optics—including the lensfocal length and aperture.) As long as the captured pixels yieldinformation useful for the intended operation, they need not be inperfect focus.

Sometime focusing algorithms hunt for, but fail to achieve focus—wastingcycles and battery life. Better, in some instances, is to simply grabframes at a series of different focus settings. A search tree of focusdepths, or depths of field, may be used. This is particularly usefulwhere an image may include multiple subjects of potential interest—eachat a different plane. The system may capture a frame focused at 6 inchesand another at 24 inches. The different frames may reveal that there aretwo objects of interest within the field of view—one better captured inone frame, the other better captured in the other. Or the 24inch-focused frame may be found to have no useful data, but the 6inch-focused frame may include enough discriminatory frequency contentto see that there are two or more subject image planes. Based on thefrequency content, one or more frames with other focus settings may thenbe captured. Or a region in the 24 inch-focused frame may have one setof Fourier attributes, and the same region in the 6 inch-focused framemay have a different set of Fourier attributes, and from the differencebetween the two frames a next trial focus setting may be identified(e.g., at 10 inches), and a further frame at that focus setting may becaptured. Feedback is applied—not necessarily to obtain perfect focuslock, but in accordance with search criteria to make decisions aboutfurther captures that may reveal additional useful detail. The searchmay fork and branch, depending on the number of subjects discerned, andassociated Fourier, etc., information, until satisfactory informationabout all subjects has been gathered.

A related approach is to capture and buffer plural frames as a cameralens system is undergoing adjustment to an intended focus setting.Analysis of the frame finally captured at the intended focus may suggestthat intermediate focus frames would reveal useful information, e.g.,about subjects not earlier apparent or significant. One or more of theframes earlier captured and buffered can then be recalled and processedto provide information whose significance was not earlier recognized.

Camera control can also be responsive to spatial coordinate information.By using geolocation data, and orientation (e.g., by magnetometer), thecamera can check that it is capturing an intended target. The cameraset-up module may request images of not just certain exposureparameters, but also of certain subjects, or locations. When a camera isin the correct position to capture a specific subject (which may havebeen previously user-specified, or identified by a computer process),one or more frames of image data automatically can be captured. (In somearrangements, the orientation of the camera is controlled by steppermotors or other electromechanical arrangements, so that the camera canautonomously set the azimuth and elevation to capture image data from aparticular direction, to capture a desired subject. Electronic or fluidsteering of the lens direction can also be utilized.)

As noted, the camera setup module may instruct the camera to capture asequence of frames. In addition to benefits such as synthesizing highdynamic range imagery, such frames can also be aligned and combined toobtain super-resolution images. (As is known in the art,super-resolution can be achieved by diverse methods. For example, thefrequency content of the images can be analyzed, related to each otherby linear transform, affine-transformed to correct alignment, thenoverlaid and combined. In addition to other applications, this can beused in decoding digital watermark data from imagery. If the subject istoo far from the camera to obtain satisfactory image resolutionnormally, it may be doubled by such super-resolution techniques toobtain the higher resolution needed for successful watermark decoding.)

In the exemplary embodiment, each processing stage substituted theresults of its processing for the input data contained in the packetwhen received. In other arrangements, the processed data can be added tothe packet body, while maintaining the data originally present. In suchcase the packet grows during processing—as more information is added.While this may be disadvantageous in some contexts, it can also provideadvantages. For example, it may obviate the need to fork a processingchain into two packets or two threads. Sometimes both the original andthe processed data are useful to a subsequent stage. For example, an FFTstage may add frequency domain information to a packet containingoriginal pixel domain imagery. Both of these may be used by a subsequentstage, e.g., in performing sub-pixel alignment for super-resolutionprocessing. Likewise, a focus metric may be extracted from imagery andused—with accompanying image data—by a subsequent stage.

It will be recognized that the detailed arrangements can be used tocontrol the camera to generate different types of image data on aper-frame basis, and to control subsequent stages of the system toprocess each such frame differently. Thus, the system may capture afirst frame under conditions selected to optimize green watermarkdetection, capture a second frame under conditions selected to optimizebarcode reading, capture a third frame under conditions selected tooptimize facial recognition, etc. Subsequent stages may be directed toprocess each of these frames differently, in order to best extract thedata sought. All of the frames may be processed to sense illuminationvariations. Every other frame may be processed to assess focus, e.g., bycomputing 16×16 pixel FFTs at nine different locations within the imageframe. (Or there may be a fork that allows all frames to be assessed forfocus, and the focus branch may be disabled when not needed, orreconfigured to serve another purpose.) Etc., etc.

In some implementations, frame capture can be tuned to capture thesteganographic calibration signals present in a digital watermarksignal, without regard to successful decoding of the watermark payloaddata itself. For example, captured image data can be at a lowerresolution—sufficient to discern the calibration signals, butinsufficient to discern the payload. Or the camera can expose the imagewithout regard to human perception, e.g., overexposing so imagehighlights are washed-out, or underexposed so other parts of the imageare indistinguishable. Yet such an exposure may be adequate to capturethe watermark orientation signal. (Feedback can of course be employed tocapture one or more subsequent image frames—redressing one or moreshortcomings of a previous image frame.)

Some digital watermarks are embedded in specific color channels (e.g.,blue), rather than across colors as modulation of image luminance (see,e.g., commonly-owned patent application Ser. No. 12/337,029 to Reed, nowpublished as 20100150434). In capturing a frame including such awatermark, exposure can be selected to yield maximum dynamic range inthe blue channel (e.g., 0-255 in an 8-bit sensor), without regard toexposure of other colors in the image. One frame may be captured tomaximize dynamic range of one color, such as blue, and a later frame maybe captured to maximize dynamic range of another color channel, such asyellow (i.e., along the red-green axis). These frames may then bealigned, and the blue-yellow difference determined. The frames may havewholly different exposure times, depending on lighting, subject, etc.

Desirably, the system has an operational mode in which it captures andprocesses imagery even when the user is not intending to “snap” apicture. If the user pushes a shutter button, the otherwise-scheduledimage capture/processing operations may be suspended, and a consumerphoto taking mode can take precedence. In this mode, capture parametersand processes designed to enhance human visual system aspects of theimage can be employed instead.

(It will be recognized that the particular embodiment shown in FIG. 16generates packets before any image data is collected. In contrast, FIG.10A and associated discussion do not refer to packets existing beforethe camera. Either arrangement can be used in either embodiment. Thatis, in FIG. 10A, packets may be established prior to the capture ofimage data by the camera, in which case the visual keyvector processingand packaging module serves to insert the pixel data—or more typically,sub-sets or super-sets of the pixel data—into earlier-formed packets.Similarly, in FIG. 16, the packets need not be created until after thecamera has captured image data.)

As noted earlier, one or more of the processing stages can be remotefrom the cell phone. One or more pixel packets can be routed to thecloud (or through the cloud) for processing. The results can be returnedto the cell phone, or forwarded to another cloud processing stage (orboth). Once back at the cell phone, one or more further local operationsmay be performed. Data may then be sent back out the cloud, etc.Processing can thus alternate between the cell phone and the cloud.Eventually, result data is usually presented to the user back at thecell phone.

Applicants expect that different vendors will offer competing cloudservices for specialized processing tasks. For example, Apple, Googleand Facebook, may each offer cloud-based facial recognition services. Auser device would transmit a packet of processed data for processing.The header of the packet can indicate the user, the requested service,and—optionally—micropayment instructions. (Again, the header couldconvey an index or other identifier by which a desired transaction islooked-up in a cloud database, or which serves to arrange an operation,or a sequence of processes for some transaction—such as a purchase, aposting on Facebook, a face- or object-recognition operation, etc. Oncesuch an indexed transaction arrangement is initially configured, it canbe easily invoked simply by sending a packet to the cloud containing theimage-related data, and an identifier indicating the desired operation.)

At the Apple service, for example, a server may examine the incomingpacket, look-up the user's iPhoto account, access facial recognitiondata for the user's friends from that account, compute facialrecognition features from image data conveyed with the packet, determinea best match, and return result information (e.g., a name of a depictedindividual) back to the originating device.

At the IP address for the Google service, a server may undertake similaroperations, but would refer to the user's Picasa account. Ditto forFacebook.

Identifying a face from among faces for dozens or hundreds of knownfriends is easier than identifying faces of strangers. Other vendors mayoffer services of the latter sort. For example, L-1 Identity Solutions,Inc. maintains databases of images from government-issuedcredentials—such as drivers' licenses. With appropriate permissions, itmay offer facial recognition services drawing from such databases.

Other processing operations can similarly be operated remotely. One is abarcode processor, which would take processed image data sent from themobile phone, apply a decoding algorithm particular to the type ofbarcode present. A service may support one, a few, or dozens ofdifferent types of barcode. The decoded data may be returned to thephone, or the service provider can access further data indexed by thedecoded data, such as product information, instructions, purchaseoptions, etc., and return such further data to the phone. (Or both canbe provided.)

Another service is digital watermark reading. Another is opticalcharacter recognition (OCR). An OCR service provider may further offertranslation services, e.g., converting processed image data into ASCIIsymbols, and then submitting the ASCII words to a translation engine torender them in a different language. Other services are sampled in FIG.2. (Practicality prevents enumeration of the myriad other services, andcomponent operations, that may also be provided.)

The output from the remote service provider is commonly returned to thecell phone. In many instances the remote service provider will returnprocessed image data. In some cases it may return ASCII or other suchdata. Sometimes, however, the remote service provider may produce otherforms of output, including audio (e.g., MP3) and/or video (e.g., MPEG4and Adobe Flash).

Video returned to the cell phone from the remote provider may bepresented on the cell phone display. In some implementations such videopresents a user interface screen, inviting the user to touch or gesturewithin the displayed presentation to select information or an operation,or issue an instruction. Software in the cell phone can receive suchuser input and undertake responsive operations, or present responsiveinformation.

In still other arrangements, the data provided back to the cell phonefrom the remote service provider can include JavaScript or other suchinstructions. When run by the cell phone, the JavaScript provides aresponse associated with the processed data referred out to the remoteprovider.

Remote processing services can be provided under a variety of differentfinancial models. An Apple iPhone service plan may be bundled with avariety of remote services at no additional cost, e.g., iPhoto-basedfacial recognition. Other services may bill on a per-use, monthlysubscription, or other usage plans.

Some services will doubtless be highly branded, and marketed. Others maycompete on quality; others on price.

As noted, stored data may indicate preferred providers for differentservices. These may be explicitly identified (e.g., send all FFToperations to the Fraunhofer Institute service), or they can bespecified by other attributes. For example, a cell phone user may directthat all remote service requests are to be routed to providers that areranked as fastest in a periodically updated survey of providers (e.g.,by Consumers Union). The cell phone can periodically check the publishedresults for this information, or it can be checked dynamically when aservice is requested. Another user may specify that service requests areto be routed to service providers that have highest customersatisfaction scores—again by reference to an online rating resource.Still another user may specify that requests should be routed to theproviders having highest customer satisfaction scores—but only if theservice is provided for free; else route to the lowest cost provider.Combinations of these arrangements, and others, are of course possible.The user may, in a particular case, specify a particular serviceprovider—trumping any selection that would be made by the stored profiledata.

In still other arrangements the user's request for service can beexternally posted, and several service providers may express interest inperforming the requested operation. Or the request can be sent toseveral specific service providers for proposals (e.g., to Amazon,Google and Microsoft). The different providers' responses (pricing,other terms, etc.) may be presented to the user, who selects betweenthem, or a selection may be made automatically—based on previouslystored rules. In some cases, one or more competing service providers canbe provided user data with which they start performing, or whollyperform, the subject operation before a service provider selection isfinally made—giving such providers a chance to speed their responsetimes, and encounter additional real-world data. (See, also, the earlierdiscussion of remote service providers, including auction-basedservices, e.g., in connection with FIGS. 7-12.)

As elsewhere indicated, certain external service requests may passthrough a common hub (module), which is responsible for distributing therequests to appropriate service providers. Reciprocally, results fromcertain external service requests may similarly be routed through acommon hub. For example, payloads decoded by different service providersfrom different digital watermarks (or payloads decoded from differentbarcodes, or fingerprints computed from different content objects) maybe referred to a common hub, which may compile statistics and aggregateinformation (akin to Nielsen's monitoring services—surveying consumerencounters with different data). Besides the decoded watermark data(barcode data, fingerprint data), the hub may also (or alternatively) beprovided with a quality or confidence metric associated with eachdecoding/computing operation. This may help reveal packaging issues,print issues, media corruption issues, etc., that need consideration.

Pipe Manager

In the FIG. 16 implementation, communications to and from the cloud arefacilitated by a pipe manager 51. This module (which may be realized asthe cell phone-side portion of the query router and response manager ofFIG. 7) performs a variety of functions relating to communicating acrossa data pipe 52. (It will be recognized that pipe 52 is a data constructthat may comprise a variety of communication channels.)

One function performed by pipe manager 51 is to negotiate for neededcommunication resources. The cell phone can employ a variety ofcommunication networks and commercial data carriers, e.g., cellulardata, WiFi, Bluetooth, etc.—any or all of which may be utilized. Eachmay have its own protocol stack. In one respect the pipe manager 51interacts with respective interfaces for these data channels—determiningthe availability of bandwidth for different data payloads.

For example, the pipe manager may alert the cellular data carrier localinterface and network that there will be a payload ready fortransmission starting in about 450 milliseconds. It may further specifythe size of the payload (e.g., two megabits), its character (e.g., blockdata), and a needed quality of service (e.g., data throughput rate). Itmay also specify a priority level for the transmission, so that theinterface and network can service such transmission ahead oflower-priority data exchanges, in the event of a conflict.

The pipe manager knows the expected size of the payload due toinformation provided by the control processor module 36. (In theillustrated embodiment, the control processor module specifies theparticular processing that will yield the payload, and so it canestimate the size of the resulting data). The control processor modulecan also predict the character of the data, e.g., whether it will beavailable as a fixed block or intermittently in bursts, the rate atwhich it will be provided for transmission, etc. The control processormodule 36 can also predict the time at which the data will be ready fortransmission. The priority information, too, is known by the controlprocessor module. In some instances the control processor moduleautonomously sets the priority level. In other instances the prioritylevel is dictated by the user, or by the particular application beingserviced.

For example, the user may expressly signal—through the cell phone'sgraphical user interface, or a particular application may regularlyrequire, that an image-based action is to be processed immediately. Thismay be the case, for example, where further action from the user isexpected based on the results of the image processing. In other casesthe user may expressly signal, or a particular application may normallypermit, that an image-based action can be performed whenever convenient(e.g., when needed resources have low or nil utilization). This may bethe case, for example, if a user is posting a snapshot to a socialnetworking site such as Facebook, and would like the image annotatedwith names of depicted individuals—through facial recognitionprocessing. Intermediate prioritization (expressed by the user, or bythe application) can also be employed, e.g., process within a minute,ten minutes, an hour, a day, etc.

In the illustrated arrangement, the control processor module 36 informsthe pipe manager of the expected data size, character, timing, andpriority, so that the pipe manager can use same in negotiating for thedesired service. (In other embodiments, less or more information can beprovided.)

If the carrier and interface can meet the pipe manager's request,further data exchange may ensue to prepare for the data transmission andready the remote system for the expected operation. For example, thepipe manager may establish a secure socket connection with a particularcomputer in the cloud that is to receive that particular data payload,and identify the user. If the cloud computer is to perform a facialrecognition operation, it may prepare for the operation by retrievingfrom Apple/Google/Facebook the facial recognition features, andassociated names, for friends of the specified user.

Thus, in addition to preparing a channel for the external communication,the pipe manager enables pre-warming of the remote computer, to ready itfor the expected service request. (The service may request may notfollow.) In some instances the user may operate the shutter button, andthe cell phone may not know what operation will follow. Will the userrequest a facial recognition operation? A barcode decoding operation?Posting of the image to Flickr or Facebook? In some cases the pipemanager—or control processor module—may pre-warm several processes. Orit may predict, based on past experience, what operation will beundertaken, and warm appropriate resources. (E.g., if the user performedfacial recognition operations following the last three shutteroperations, there's a good chance the user will request facialrecognition again.) The cell phone may actually start performingcomponent operations for various of the possible functions before anyhas been selected—particularly those operations whose results may beuseful to several of the functions.

Pre-warming can also include resources within the cell phone:configuring processors, loading caches, etc.

The situation just reviewed contemplates that desired resources areready to handle the expected traffic. In another situation the pipemanager may report that the carrier is unavailable (e.g., due to theuser being in a region of impaired radio service). This information isreported to control processor module 36, which may change the scheduleof image processing, buffer results, or take other responsive action.

If other, conflicting, data transfers are underway, the carrier orinterface may respond to the pipe manager that the requestedtransmission cannot be accommodated, e.g., at the requested time or withthe requested quality of service. In this case the pipe manager mayreport same to the control processor module 36. The control processormodule may abort the process that was to result in the two megabit dataservice requirement and reschedule it for later. Alternatively, thecontrol processor module may decide that the two megabit payload may begenerated as originally scheduled, and the results may be locallybuffered for transmission when the carrier and interface are able to doso. Or other action may be taken.

Consider a business gathering in which a participant gathers a group fora photo before dinner. The user may want all faces in the photo to berecognized immediately, so that they can be quickly reviewed to avoidthe embarrassment of not recalling a colleague's name. Even before theuser operates the cell phone's user-shutter button the control processormodule causes the system to process frames of image data, and isidentifying apparent faces in the field of view (e.g., oval shapes, withtwo seeming eyes in expected positions). These may be highlighted byrectangles on the cell phone's viewfinder (screen) display.

While current cameras have picture-taking modes based on lens/exposureprofiles (e.g., close-up, nighttime, beach, landscape, snow scenes,etc), imaging devices employing principles of the present technology mayadditionally (or alternatively) have different image-processing modes.One mode may be selected by the user to obtain names of people depictedin a photo (e.g., through facial recognition). Another mode may beselected to perform optical character recognition of text found in animage frame. Another may trigger operations relating to purchasing adepicted item. Ditto for selling a depicted item. Ditto for obtaininginformation about a depicted object, scene or person (e.g., fromWikipedia, a social network, a manufacturer's web site), etc. Ditto forestablishing a ThinkPipe session with the item, or a related system.Etc.

These modes may be selected by the user in advance of operating ashutter control, or after. In other arrangements, plural shuttercontrols (physical or GUI) are provided for the user—respectivelyinvoking different of the available operations. (In still otherembodiments, the device infers what operation(s) is/are possiblydesired, rather than having the user expressly indicate same.)

If the user at the business gathering takes a group shot depictingtwelve individuals, and requests the names on an immediate basis, thepipeline manager 51 may report back to the control processor module (orto application software) that the requested service cannot be provided.Due to a bottleneck or other constraint, the manager 51 may report thatidentification of only three of the depicted faces can be accommodatedwithin service quality parameters considered to constitute an“immediate” basis. Another three faces may be recognized within twoseconds, and recognition of the full set of faces may be expected infive seconds. (This may be due to a constraint by the remote serviceprovider, rather than the carrier, per se.)

The control processor module 36 (or application software) may respond tothis report in accordance with an algorithm, or by reference to a ruleset stored in a local or remote data structure. The algorithm or ruleset may conclude that for facial recognition operations, delayed serviceshould be accepted on whatever terms are available, and the user shouldbe alerted (through the device GUI) that there will be a delay of aboutN seconds before full results are available. Optionally, the reportedcause of the expected delay may also be exposed to the user. Otherservice exceptions may be handled differently—in some cases with theoperation aborted or rescheduled or routed to a less-preferred provider,and/or with the user not alerted.

In addition to considering the ability of the local device interface tothe network, and the ability of the network/carrier, to handle theforecast data traffic (within specified parameters), the pipelinemanager may also query resources out in the cloud—to ensure that theyare able to perform whatever services are requested (within specifiedparameters). These cloud resources can include, e.g., data networks andremote computers. If any responds in the negative, or with a servicelevel qualification, this too can be reported back to the controlprocessor module 36, so that appropriate action can be taken.

In response to any communication from the pipe manager 51 indicatingpossible trouble servicing the expected data flow, the control process36 may issue corresponding instructions to the pipe manager and/or othermodules, as necessary.

In addition to the just-detailed tasks of negotiating in advance forneeded services, and setting up appropriate data connections, the pipemanager can also act as a flow control manager—orchestrating thetransfer of data from the different modules out of the cell phone,resolving conflicts, and reporting errors back to the control processormodule 36.

While the foregoing discussion has focused on outbound data traffic,there is a similar flow inbound, back to the cell phone. The pipemanager (and control processor module) can help administer this trafficas well—providing services complementary to those discussed inconnection with outbound traffic.

In some embodiments, there may be a pipe manager counterpart module 53out in the cloud—cooperating with pipe manager 51 in the cell phone inperformance of the detailed functionality.

Software Embodiment of Control Processor & Pipe Manager

Research in the area of autonomous robotics shares some similarchallenges with the scenarios described herein, specifically that ofenabling a system of sensors to communicate data to local and remoteprocesses, resulting in action to be taken locally. In the case of arobotics it involves moving a robot out of harm's way; in the case ofthe present technology it is most commonly focused on providing adesired experience based on image, sound, etc. encountered.

As opposed to performing simple operations such as obstacle avoidance,aspects of the present technology desire to provide higher levels ofsemantics, and hence richer experiences, based on sensory input. A userpointing a camera at a poster does not want to know the distance to thewall; the user is much more inclined to want to know the about thecontent on the poster, if it concerns a movie, where it is playing,reviews, what their friends think, etc.

Despite such differences, architectural approaches from robotic toolkitscan be adapted for use in the present context. One such robotic toolkitis such as the Player Project—a set of free software tools for robot andsensor applications, available as open source from sourceforge-dot-net.

An illustration of the Player Project architecture is shown in FIG. 19A.The mobile robot (which typically has a relatively low performanceprocessor) communicates with a fixed server (with a relatively higherperformance processor) using a wireless protocol. Various sensorperipherals are coupled to the mobile robot (client) processor throughrespective drivers, and an API. Likewise, services may be invoked by theserver processor from software libraries, through another API. (The CMUCMVision library is shown in FIG. 19A.)

(In addition to the basic tools for interfacing robotic equipment tosensors and service libraries, the Player Project includes “Stage”software that simulates a population of mobile robots moving in a 2Denvironment, with various sensors and processing—including visual blobdetection. “Gazebo” extends the Stage model to 3D.)

By such system architecture, new sensors can quickly be utilized—byprovision of driver software that interfaces with the robot API.Similarly, new services can be readily plugged in through the serverAPI. The two Player Project APIs provide standardized abstractions sothat the drivers and services do not need to concern themselves with theparticular configuration of the robot or server (and vice-versa).

(FIG. 20A, discussed below, also provides a layer of abstraction betweenthe sensors, the locally-available operations, and theexternally-available operations.)

Certain embodiments of the present technology can be implemented using alocal process & remote process paradigm akin to that of the PlayerProject, connected by a packet network and inter-process & intra-processcommunication constructs familiar to artisans (e.g., named pipes,sockets, etc.). Above the communication minutiae is a protocol by whichdifferent processes may communicate; this may take the form of a messagepassing paradigm and message queue, or more of a network centricapproach where collisions of keyvectors are addressed after the fact(re-transmission, drop if timely in nature, etc.).

In such embodiments, data from sensors on the mobile device (e.g.,microphone, camera) can be packaged in keyvector form, with associatedinstructions. The instruction(s) associated with data may not beexpress; they can be implicit (such as Bayer conversion) or sessionspecific—based on context or user desires (in a photo taking mode, facedetection may be presumed.)

In a particular arrangement, keyvectors from each sensor are created andpackaged by device driver software processes that abstract the hardwarespecific embodiments of the sensor and provide a fully formed keyvectoradhering to a selected protocol.

The device driver software can then place the formed keyvector on anoutput queue unique to that sensor, or in a common message queue sharedby all the sensors. Regardless of approach, local processes can consumethe keyvectors and perform the needed operations before placing theresultant keyvectors back on the queue. Those keyvectors that are to beprocessed by remote services are then placed in packets and transmitteddirectly to a remote processes for additional processing or to a remoteservice that distributes the keyvectors—similar to a router. It shouldbe clear to the reader, that commands to initialize or setup any of thesensors or processes in the system can be distributed in a similarfashion from a Control Process (e.g., box 36 in FIG. 16.)

Branch Prediction; Commercial Incentives

The technology of branch prediction arose to meet the needs ofincreasingly complex processor hardware; it allows processors withlengthy pipelines to fetch data and instructions (and in some cases,execute the instructions), without waiting for conditional branches tobe resolved.

A similar science can be applied in the present context—predicting whataction a human user will take. For example, as discussed above, thejust-detailed system may “pre-warm” certain processors, or communicationchannels, in anticipation that certain data or processing operationswill be forthcoming.

When a user removes an iPhone from her purse (exposing the sensor toincreased light) and lifts it to eye level (as sensed byaccelerometers), what is she about to do? Reference can be made to pastbehavior to make a prediction. Particularly relevant may include whatthe user did with the phone camera the last time it was used; what theuser did with the phone camera at about the same time yesterday (and atthe same time a week ago); what the user last did at about the samelocation; etc. Corresponding actions can be taken in anticipation.

If her latitude/longitude correspond to a location within a video rentalstore, that helps. Expect to maybe perform image recognition on artworkfrom a DVD box. To speed possible recognition, perhaps SIFT or otherfeature recognition reference data should be downloaded for candidateDVDs and stored in a cell phone cache. Recent releases are goodprospects (except those rated G, or rated high for violence—storedprofile data indicates the user just doesn't have a history of watchingthose). So are movies that she's watched in the past (as indicated byhistorical rental records—also available to the phone).

If the user's position corresponds to a downtown street, andmagnetometer and other position data indicates she is looking north,inclined up from the horizontal, what's likely to be of interest? Evenwithout image data, a quick reference to online resources such as GoogleStreetview can suggest she's looking at business signage along 5^(th)Avenue. Maybe feature recognition reference data for this geographyshould be downloaded into the cache for rapid matching againstto-be-acquired image data.

To speed performance, the cache should be loaded in a rationalfashion—so that the most likely object is considered first. GoogleStreetview for that location includes metadata indicating 5^(th) Avenuehas signs for a Starbucks, a Nordstrom store, and a Thai restaurant.Stored profile data for the user reveals she visits Starbucks daily (shehas their branded loyalty card); she is a frequent clothes shopper(albeit with a Macy's, rather than a Nordstrom's charge card); and she'snever eaten at a Thai restaurant. Perhaps the cache should be loaded soas to most quickly identify the Starbucks sign, followed by Nordstrom,followed by the Thai restaurant.

Low resolution imagery captured for presentation on the viewfinder failsto trigger the camera's feature highlighting probable faces (e.g., forexposure optimization purposes). That helps. There's no need to pre-warmthe complex processing associated with facial recognition.

She touches the virtual shutter button, capturing a frame of highresolution imagery, and image analysis gets underway—trying to recognizewhat's in the field of view, so that the camera application can overlaya ranked ordering of graphical links related to objects in the capturedframe. (Or this may happen without user action—the camera may bewatching proactively.) Unlike Google web search—which ranks searchresults in an order based on aggregate user data, the camera applicationattempts a ranking customized to the user's profile. If a Starbucks signor logo is found in the frame, the Starbucks link gets top position forthis user.

If signs for Starbucks, Nordstrom, and the Thai restaurant are allfound, links would normally be presented in that order (per the user'spreferences inferred from profile data). However, the cell phoneapplication may have a capitalistic bent and be willing to promote alink by a position or two (although perhaps not to the top position) ifcircumstances warrant. In the present case, the cell phone routinelysent IP packets to the web servers at addresses associated with each ofthe links, alerting them that an iPhone user had recognized theircorporate signage from a particular latitude/longitude. (Other user datamay also be provided, if privacy considerations and user permissionsallow.) The Thai restaurant server responds back in an instant—offeringto the next two customers 25% off any one item (the restaurant's pointof sale system indicates only four tables are occupied and no order ispending; the cook is idle). The restaurant server offers three cents ifthe phone will present the discount offer to the user in itspresentation of search results, or five cents if it will also promotethe link to second place in the ranked list, or ten cents if it will dothat and be the only discount offer presented in the results list.(Starbucks also responded with an incentive, but not as attractive). Thecell phone quickly accepts the restaurant's offer, and payments arequickly made—either to the user (e.g., defraying the monthly phone bill)or more likely to the phone carrier (e.g., AT&T). Links are presented toStarbucks, the Thai restaurant, and Nordstrom, in that order, with therestaurant's link noting the discount for the next two customers.

Google's AdWord technology has already been noted. It decides, based onfactors including a reverse-auction determined payment, which ads topresent as Sponsored Links adjacent the results of a Google web search.Google has adapted this technology to present ads on third party websites and blogs, based on the particular contents of those sites,terming the service AdSense.

In accordance with another aspect of the present technology, theAdWord/AdSense technology is extended to visual image search on cellphones.

Consider a user located in a small bookstore who snaps a picture of theWarren Buffet biography Snowball. The book is quickly recognized, butrather than presenting a corresponding Amazon link atop the list (as mayoccur with a regular Google search), the cell phone recognizes that theuser is located in an independent bookstore. Context-based rulesconsequently dictate that it present a non-commercial link first. Topranked of this type is a Wall Street Journal review of the book, whichgoes to the top of the presented list of links Decorum, however, onlygoes so far. The cell phone passes the book title or ISBN (or the imageitself) to Google AdSense or AdWords, which identifies sponsored linksto be associated with that object. (Google may independently perform itsown image analysis on any provided imagery. In some cases it may pay forsuch cell phone-submitted imagery—since Google has a knack forexploiting data from diverse sources.) Per Google, Barnes and Noble hasthe top sponsored position, followed by alldiscountbooks-dot-net. Thecell phone application may present these sponsored links in agraphically distinct manner to indicate their origin (e.g., in adifferent part of the display, or presented in a different color), or itmay insert them alternately with non-commercial search results, i.e., atpositions two and four. The AdSense revenue collected by Google canagain be shared with the user, or with the user's carrier.

In some embodiments, the cell phone (or Google) again pings the serversof companies for whom links will be presented—helping them track theirphysical world-based online visibility. The pings can include thelocation of the user, and an identification of the object that promptedthe ping. When alldiscountbooks-dot-net receives the ping, it may checkinventory and find it has a significant overstock of Snowball. As in theexample earlier given, it may offer an extra payment for some extrapromotion (e.g., including “We have 732 copies—cheap!” in the presentedlink).

In addition to offering an incentive for a more prominent search listing(e.g., higher in the list, or augmented with additional information), acompany may also offer additional bandwidth to serve information to acustomer. For example, a user may capture video imagery from anelectronic billboard, and want to download a copy to show to friends.The user's cell phone identifies the content as a popular clip of usergenerated content (e.g., by reference to an encoded watermark), andfinds that the clip is available from several sites—the most popular ofwhich is YouTube, followed by MySpace. To induce the user to link toMySpace, MySpace may offer to upgrade the user's baseline wirelessservice from 3 megabits per second to 10 megabits per second, so thevideo will download in a third of the time. This upgraded service can beonly for the video download, or it can be longer. The link presented onthe screen of the user's cell phone can be amended to highlight theavailability of the faster service. (Again, MySpace may make anassociated payment.)

Sometimes alleviating a bandwidth bottleneck requires opening abandwidth throttle on a cell phone end of the wireless link Or thebandwidth service change must be requested, or authorized, by the cellphone. In such case MySpace can tell the cell phone application to takeneeded steps for higher bandwidth service, and MySpace will rebate tothe user (or to the carrier, for benefit of the user's account) theextra associated costs.

In some arrangements, the quality of service (e.g., bandwidth) ismanaged by pipe manager 51. Instructions from MySpace may request thatthe pipe manager start requesting augmented service quality, and settingup the expected high bandwidth session, even before the user selects theMySpace link

In some scenarios, vendors may negotiate preferential bandwidth for itscontent. MySpace may make a deal with AT&T, for example, that allMySpace content delivered to AT&T phone subscribers be delivered at 10megabits per second—even though most subscribers normally only receive 3megabits per second service. The higher quality service may behighlighted to the user in the presented links

Modeling of User Behavior.

Aided by knowledge of a particular physical environment, a specificplace and time, and behavior profiles of expected users, simulationmodels of human computer interaction with the physical world can bebased on tools and techniques from fields as diverse as robotics, andaudience measurement. An example of this might be the number of expectedmobile devices in a museum at a particular time; the particular sensorsthat such devices are likely to be using; and what stimuli are expectedto be captured by those sensors (e.g., where are they pointing thecamera, what is the microphone hearing, etc.). Additional informationcan include assumptions about social relationships between users: Arethey likely to share common interests? Are they within common socialcircles that are likely to share content, to share experiences, ordesire creating location-based experiences such as wiki-maps (c.f.,Barricelli, “Map-Based Wikis as Contextual and Cultural Mediators,”MobileHCI, 2009)?

In addition, modeling can be based on generalized heuristics derivedfrom observations at past events (e.g., how many people used their cellphone cameras to capture imagery from the Portland Trailblazers'scoreboard during a basketball game, etc.), to more evolved predictivemodels that are based on innate human behavior (e.g., people are morelikely to capture imagery from a scoreboard during overtime than duringa game's half-time).

Such models can inform many aspects of the experience for the users, inaddition to the business entities involved in provisioning and measuringthe experience.

These latter entities may consist of the traditional value chainparticipants involved in event production, and the arrangements involvedin measuring interaction and monetizing it. Event planners, producers,artists on the creation side and the associated rights societies (ASCAP,Directors Guild of America, etc.) on the royalties side. From ameasurement perspective, both sampling-based techniques from opt-inusers and devices, and census-driven techniques can be utilized. Metricsfor more static environments may consist of Revenue Per Unit (RPU)created by digital traffic created on the digital service providernetwork (how much bandwidth is being consumed) to more evolved models ofClick Through Rates (CTR) for particular sensor stimuli.

For example, the Mona-Lisa painting in the Louvre is likely to have amuch higher CTR than other paintings in the museum, informing matterssuch as priority for content provisioning, e.g., content related to theMona Lisa should be cached and be as close to the edge of the cloud aspossible, if not pre-loaded onto the mobile device itself when the userapproaches or enters the museum. (Of equal importance is the role thatCTR plays in monetizing the experience and environment.)

Consider a school group that enters a sculpture museum having a gardenwith a collection of Rodin works. The museum may provide content relatedto Rodin and his works on servers or infrastructure (e.g., routercaches) that serve the garden. Moreover, because the visitors comprise apre-established social group, the museum may expect some socialconnectivity. So the museum may enable sharing capabilities (e.g., adhoc networking) that might not otherwise be used. If one student queriesthe museum's online content to learn more about a particular Rodinsculpture, the system may accompany delivery of the solicitedinformation with a prompt inviting the student to share this informationwith others in the group. The museum server can suggest particular“friends” of the student with whom such information might be shared—ifsuch information is publicly accessible from Facebook or other socialnetworking data source. In addition to names of friends, such a socialnetworking data source can also provide device identifiers, IPaddresses, profile information, etc., for the student's friends—whichmay be leveraged to assist the dissemination of educational material toothers in the group. These other students may find this particularinformation relevant, since it was of interest to another in theirgroup—even if the original student's name is not identified. If theoriginal student is identified with the conveyed information, then thismay heighten the information's interest to others in the group.

(Detection of a socially-linked group may be inferred from review of themuseum's network traffic. For example, if a device sends packets of datato another, and the museum's network handles both ends of thecommunication—dispatch and delivery, then there's an association betweentwo devices in the museum. If the devices are not ones that havehistorical patterns of network usage, e.g., employees, then the systemcan conclude that two visitors to the museum are socially connected. Ifa web of such communications is detected—involving several unfamiliardevices, then a social group of visitors can be discerned. The size ofthe group can be gauged by the number of different participants in suchnetwork traffic. Demographic information about the group can be inferredfrom external addresses with which data is exchanged; middle schoolersmay have a high incidence of MySpace traffic; college students maycommunicate with external addresses at a university domain; seniorcitizens may demonstrate a different traffic profile. All suchinformation arising from traffic analysis can be employed inautomatically adapting the information and services provided to thevisitors—as well as providing useful information to the museum'sadministration and marketing departments.)

Consider other situations. One is halftime at the U.S. footballSuperbowl, featuring a headline performer (e.g., Bruce Springsteen, orPrince). The show may cause hundreds of fans to capture pictures oraudio-video of the event. Another context with predictable publicbehavior is the end of an NBA championship basketball game. Fans maywant to memorialize the final buzzer excitement: the scoreboard,streamers and confetti dropping from the ceiling, etc. In such cases,actions that can be taken to prepare, or optimize, delivery of contentor experience should be taken. Examples include rights clearance forassociated content, rendering virtual worlds and other synthesizedcontent, throttling down routine time-insensitive network traffic,queuing commercial resources that may be invoked as people purchasesouvenir books/music from Amazon (caching pages, authenticating users tofinancial sites), propagating links for post-game interviews (someprebuilt/edited and ready to go), caching the Twitter feeds of the starplayers, buffering video from city center showing the hometown crowdswatching on a Jumbotron display—erupting with joy at the buzzer, etc.;anything relating to the experience or follow-on actions shouldprepped/cached in advance, where possible.

Stimuli for sensors (audio, visual, tactile, odor, etc.) that are mostlikely to instigate user action and attention are much more valuablefrom a commercial standpoint than stimuli less likely to instigate suchaction (similar to the economic principles on which Google's Adwordsad-serving system is based). Such factors and metrics directly influenceadvertising models through auction models well understood by those inthe art.

Multiple delivery mechanisms exist for advertising delivery by thirdparties, leveraging known protocols such as VAST. VAST (Digital Video AdServing Template) is a standard issued by the Interactive AdvertisingBureau that establishes reference communication protocols betweenscriptable video rendering systems and ad servers, as well as associatedXML schemas. As an example, VAST helps standardize the service of videoads to independent web sites (replacing old-style banner ads), commonlybased on a bit of Javascript included in the web page code—code thatalso aids in tracking traffic and managing cookies. VAST can also insertpromotional messages in the pre-roll and post-roll viewing of othervideo content delivered by the web site. The web site owner doesn'tconcern itself with selling or running the advertisements, yet at theend of the month the web site owner receives payment based on audienceviewership/impressions. In similar fashion, physical stimuli presentedto users in the real world, sensed by mobile technology, can be thebasis for payments to the parties involved.

Dynamic environments in which stimulus presented to users and theirmobile devices can be controlled (such as video displays, as contrastedwith static posters) provide new opportunities for measurement andutilization of metrics such as CTR.

Background music, content on digital displays, illumination, etc., canbe modified to maximize CTR and shape traffic. For example, illuminationon particular signage can be increased, or flash, as a targetedindividual passes by. Similarly when a flight from Japan lands at theairport, digital signage, music, etc. can all be modified overtly(change in the advertising to the interests of the expected audience) orcovertly (changing the linked experience to take the user to theJapanese language website), to maximize the CTR.

Mechanisms may be introduced as well to contend with rogue orun-approved sensor stimuli. Within the confined spaces of a universityor business park, stimuli (posters, music, digital signage, etc.) thatdon't adhere to the intentions or policies of the property owner—or theentity responsible for a domain—may need to be managed. This can beaccomplished through the use of simple blocking mechanisms that aregeography-specific (not dissimilar to region coding on DVD's),indicating that all attempts within specific GPS coordinates to route akeyvector to a specific place in the cloud must be mediated by a routingservice or gateway managed by the domain owner.

Other options include filtering the resultant experience. Is it ageappropriate? Does it run counter to pre-existing advertising or brandingarrangements, such as a Coca Cola advertisement being delivered to auser inside the Pepsi center during a Denver Nuggets game.

This may be accomplished on the device as well, through the use ofcontent rules, such as the Movielabs Content Recognition Rules relatedto conflicting media content (c.f., www.movielabs-dot-com/CRR), parentalcontrols provided by carriers to the device, or by adhering to DMCAAutomatic Take Down Notices.

Under various rights management paradigms, licenses play a key role indetermining how content can be consumed, shared, modified etc. A resultof extracting semantic meaning from stimulus presented to the user (andthe user's mobile device), and/or the location in which stimulus ispresented, can be issuance of a license to desired content orexperiences (games, etc.) by third parties. To illustrate, consider auser at a rock concert in an arena. The user may be granted a temporarylicense to peruse and listen to all music tracks by the performingartist (and/or others) on iTunes—beyond the 30 second preview rightsnormally granted to the public. However, such license may only persistduring the concert, or only from when the doors open until the headlineact begins its performance, or only while the user is in the arena, etc.Thereafter, such license ends.

Similarly, passengers disembarking from an international flight may begranted location-based or time-limited licenses to translation servicesor navigation services (e.g., an augmented reality system overlayingdirections for baggage claim, bathrooms, etc., on camera-capturedscenes) for their mobile devices, while they transit through customs,are in the airport, for 90 minutes after their arrival, etc.

Such arrangements can serve as metaphors for experience, and asfiltering mechanisms. One embodiment in which sharing of experiences aretriggered by sensor stimuli is through broadcast social networks (e.g.,Twitter) and syndication protocols (e.g., RSS web feeds/channels). Otherusers, entities or devices can subscribe to such broadcasts/feeds as thebasis for subsequent communication (social, information retrieval,etc.), as logging of activities (e.g., a person's daily journal), ormeasurement (audience, etc.). Traffic associated with suchnetworks/feeds can also be measured by devices at a particularlocation—allowing users to traverse in time to understand who wascommunicating what at a particular point in time. This enables searchingfor and mining additional information, e.g., was my friend here lastweek? Was someone from my peer group here? What content was consumed?Such traffic also enables real-time monitoring of how users shareexperiences. Monitoring “tweets” about a performer's song selectionduring a concert may cause the performer to alter the songs to be playedfor the remainder of a concert. The same is true for brand management.For example, if users share their opinions about a car during a carshow, live keyword filtering on the traffic can allow the brand owner tore-position certain products for maximum effect (e.g., the new model ofCorvette should spend more time on the spinning platform, etc.).

More on Optimization

Predicting the user's action or intent is one form of optimization.Another form involves configuring the processing so as to improveperformance.

To illustrate one particular arrangement, consider again the CommonServices Sorter of FIG. 6. What keyvector operations should be performedlocally, or remotely, or as a hybrid of some sort? In what order shouldkeyvector operations be performed? Etc. The mix of expected operations,and their scheduling, should be arranged in an appropriate fashion forthe processing architecture being used, the circumstances, and thecontext.

One step in the process is to determine which operations need to occur.This determination can be based on express requests from the user,historical patterns of usage, context and status, etc.

Many operations are high level functions, which involve a number ofcomponent operations—performed in a particular order. For example,optical character recognition may require edge detection, followed byregion-of-interest segmentation, followed by template pattern matching.Facial recognition may involve skintone detection, Hough transforms (toidentify oval-shaped areas), identification of feature locations(pupils, corners of mouth, nose), eigenface calculation, and templatematching.

The system can identify the component operations that may need to beperformed, and the order in which their respective results are required.Rules and heuristics can be applied to help determine whether theseoperations should be performed locally or remotely.

For example, at one extreme, the rules may specify that simpleoperations, such as color histograms and thresholding, should generallybe performed locally. At the other extreme, complex operations mayusually default to outside providers.

Scheduling can be determined based on which operations are preconditionsto other operations. This can also influence whether an operation isperformed locally or remotely (local performance may provide quickerresults—allowing subsequent operations to be started with less delay).The rules may seek to identify the operation whose output(s) is used bythe greatest number of subsequent operations, and perform this operationfirst (its respective precedent(s) permitting). Operations that arepreconditions to successively fewer other operations are performedsuccessively later. The operations, and their sequence, may be conceivedas a tree structure—with the most globally important performed first,and operations of lesser relevance to other operations performed later.

Such determinations, however, may also be tempered (or dominated) byother factors. One is power. If the cell phone battery is low, or if anoperation will involve a significant drain on a low capacity battery,this can tip the balance in favor of having the operation performedremotely.

Another factor is response time. In some instances, the limitedprocessing capability of the cell phone may mean that processing locallyis slower than processing remotely (e.g., where a more robust, parallel,architecture might be available to perform the operation). In otherinstances, the delays of establishing communication with a remoteserver, and establishing a session, may make local performance of anoperation quicker. Depending on user demand, and needs of otheroperation(s), the speed with which results are returned may beimportant, or not.

Still another factor is user preferences. As noted elsewhere, the usermay set parameters influencing where, and when, operations areperformed. For example, a user may specify that an operation may bereferred to remote processing by a domestic service provider, but ifnone is available, then the operation should be performed locally.

Routing constraints are another factor. Sometimes the cell phone will bein a WiFi or other service area (e.g., in a concert arena) in which thelocal network provider places limits or conditions on remote servicerequests that may be accessed through that network. In a concert wherephotography is forbidden, for example, the local network may beconfigured to block access to external image processing serviceproviders for the duration of the concert. In this case, servicesnormally routed for external execution should be performed locally.

Yet another factor is the particular hardware with which the cell phoneis equipped. If a dedicated FFT processor is available in the phone,then performing intensive FFT operations locally makes sense. If only afeeble general purpose CPU is available, then an intensive FFT operationis probably best referred out for external execution.

A related factor is current hardware utilization. Even if a cell phoneis equipped with hardware that is well configured for a certain task, itmay be so busy and backlogged that the system may refer a next task ofthis sort to an external resource for completion.

Another factor may be the length of the local processing chain, and therisk of a stall. Pipelined processing architectures may become stalledfor intervals as they wait for data needed to complete an operation.Such a stall can cause all other subsequent operations to be similarlydelayed. The risk of a possible stall can be assessed (e.g., byhistorical patterns, or knowledge that completion of an operationrequires further data whose timely availability is not assured—such as aresult from another external process) and, if the risk is great enough,the operation may be referred for external processing to avoid stallingthe local processing chain.

Yet another factor is connectivity status. Is a reliable, high speednetwork connection established? Or are packets dropped, or network speedslow (or wholly unavailable)?

Geographical considerations of different sorts can also be factors. Oneis network proximity to the service provider. Another is whether thecell phone has unlimited access to the network (as in a home region), ora pay-per-use arrangement (as when roaming in another country).

Information about the remote service provider(s) can also be factored.Is the service provider offering immediate turnaround, or are requestedoperations placed in a long queue, behind other users awaiting service?Once the provider is ready to process the task, what speed of executionis expected? Costs may also be key factors, together with otherattributes of importance to the user (e.g., whether the service providermeets “green” standards of environmental responsibility). A great manyother factors can also be considered, as may be appropriate inparticular contexts. Sources for such data can include the variouselements shown in the illustrative block diagrams, as well as externalresources.

A conceptual illustration of the foregoing is provided in FIG. 19B.

Based on the various factors, a determination can be made as to whetheran operation should be performed locally, or remotely. (The same factorsmay be assessed in determining the order in which operations should beperformed.)

In some embodiments, the different factors can be quantified by scores,which can be combined in polynomial fashion to yield an overall score,indicating how an operation should be handled. Such an overall scoreserves as a metric indicating the relative suitability of the operationfor remote or external processing. (A similar scoring approach can beemployed to choose between different service providers.)

Depending on changing circumstances, a given operation may be performedlocally at one instant, and performed remotely at a later instant (orvice versa). Or, the same operation may be performed on two sets ofkeyvector data at the same time—one locally, and one remotely.

While described in the context of determining whether an operationshould be performed locally or remotely, the same factors can influenceother matters as well. For example, they can also be used in decidingwhat information is conveyed by keyvectors.

Consider a circumstance in which the cell phone is to perform OCR oncaptured imagery. With one set of factors, unprocessed pixel data from acaptured image may be sent to a remote service provider to make thisdetermination. Under a different set of factors, the cell phone mayperform initial processing, such as edge detection, and then package theedge-detected data in keyvector form, and route to an external providerto complete the OCR operation. Under still another set of factors, thecell phone may perform all of the component OCR operations up until thelast (template matching), and send out data only for this lastoperation. (Under yet another set of factors, the OCR operation may becompleted wholly by the cell phone, or different components of operationcan be performed alternately by the cell phone and remote serviceprovider(s), etc.)

Reference was made to routing constraints as one possible factor. Thisis a particular example of a more general factor—external businessrules. Consider the earlier example of a user who is attending an eventat the Pepsi Center in Denver. The Pepsi Center may provide wirelesscommunication services to patrons, through its own WiFi or othernetwork. Naturally, the Pepsi Center is reluctant for its networkresources to be used for the benefit of competitors, such as Coca Cola.The host network may thus influence cloud services that can be utilizedby its patrons (e.g., by making some inaccessible, or by giving lowerpriority to data traffic of certain types, or with certaindestinations). The domain owner may exert control over what operations amobile device is capable of performing. This control can influence thelocal/remote decision, as well as the type of data conveyed in keyvectorpackets.

Another example is a gym, which may want to impede usage of cell phonecameras, e.g., by interfering with access to remote service providersfor imagery, as well as photo sharing sites such as Flickr and Picasa.Still another example is a school which, for privacy reasons, may wantto discourage facial recognition of its students and staff. In suchcase, access to facial recognition service providers can be blocked, orgranted only on a moderated case-by-case basis. Venues may find itdifficult to stop individuals from using cell phone cameras—or usingthem for particular purposes, but they can take various actions toimpede such use (e.g., by denying services that would promote orfacilitate such use).

The following outline identifies other factors that may be relevant indetermining which operations are performed where, and in what sequence:

1. Scheduling optimization of keyvector processing units based onnumerous factors:

-   -   Operation mix, which operations consist of similar atomic        instructions (MicroOps, Pentium II etc.)    -   Stall states, which operations will generate stalls due to:        -   waiting for external keyvector processing        -   poor connectivity        -   user input        -   change in user focus    -   Cost of operation based on:        -   published cost        -   expected cost based on state of auction        -   state of battery and power mode        -   power profile of the operation (is it expensive?)        -   past history of power consumption        -   opportunity cost, given the current state of the device,            e.g., what other processes should take priority such as a            voice call, GPS navigation, etc.        -   user preferences, i.e., I want a “green” provider, or open            source provider        -   legal uncertainties (e.g., certain providers may be at            greater risk of patent infringement charges, e.g., due to            their use of an allegedly patented method)    -   Domain owner influence:        -   privacy concerns in specific physical arenas such as no face            recognition at schools        -   pre-determined content based rules prohibiting specific            operations against specific stimuli            -   voiceprint matching against broadcast songs highlighting                the use other singers (Milli-Vanilli's Grammy award was                revoked when officials discovered that the actual vocals                on the subject recording were performed by other                singers)    -   All of the above influence scheduling and ability to perform out        of order execution of keyvectors based on the optimal path to        the desired outcome        -   uncertainty in a long chain of operations, making prediction            of need for subsequent keyvector operations difficult (akin            to the deep pipeline in processors & branch            prediction)—difficulties might be due to weak metrics on            keyvectors            -   past behavior.            -   location (GPS indicates that the device is quick motion)                & pattern of GPS movements                -   is there a pattern of exposure to stimuli, such as a                    user walking through an airport terminal being                    repeatedly exposed to CNN that is being presented at                    each gate            -   proximity sensors indicating the device was placed in a                pocket, etc.            -   other approaches such as Least Recently Used (LRU) can                be used to track how infrequent the desired keyvector                operation resulted or contributed to the desired effect                (recognition of a song, etc.)

Further regarding pipelined or other time-consuming operations, aparticular embodiment may undertake some suitability testing beforeengaging a processing resource for what may be more than a thresholdnumber of clock cycles. A simple suitability test is to make sure theimage data is potentially useful for the intended purpose, as contrastedwith data that can be quickly disqualified from analysis. For example,whether it is all black (e.g., a frame captured in the user's pocket).Adequate focus can also be checked quickly before committing to anextended operation.

(The artisan will recognize that certain of the aspects of thistechnology discussed above have antecedents visible in hindsight. Forexample, considerable work has been put into instruction optimizationfor pipelined processors. Also, some devices have allowed userconfiguration of power settings, e.g., user-selectable deactivation of apower-hungry GPU in certain Apple notebooks to extend battery life.)

The above-discussed determination of an appropriate instruction mix(e.g., by the Common Services Sorter of FIG. 6) particularly consideredcertain issues arising in pipelined architectures. Different principlescan apply in embodiments in which one or more GPUs is available. Thesedevices typically have hundreds or thousands of scalar processors thatare adapted for parallel execution, so that costs of execution (time,stall risk, etc.) are small. Branch prediction can be handled by notpredicting: instead, the GPU processes for all of the potential outcomesof a branch in parallel, and the system uses whatever output correspondsto the actual branch condition when it becomes known.

To illustrate, consider facial recognition. A GPU-equipped cell phonemay invoke instructions—when its camera is activated in a userphoto-shoot mode—that configure 20 clusters of scalar processors in theGPU. (Such a cluster is sometimes termed a “stream processor.”) Inparticular, each cluster is configured to perform a Hough transform on asmall tile from a captured image frame—looking for one or more ovalshapes that may be candidate faces. The GPU thus processes the entireframe in parallel, by 20 concurrent Hough transforms. (Many of thestream processors probably found nothing, but the process speed wasn'timpaired.)

When these GPU Hough transform operations complete, the GPU may bereconfigured into a lesser number of stream processors—one dedicated toanalyzing each candidate oval shape, to determine positions of eyepupils, nose location, and distance across the mouth. For any oval thatyielded useful candidate facial information, associated parameters wouldbe packaged in keyvector form, and transmitted to a cloud service thatchecks the keyvectors of analyzed facial parameters against knowntemplates, e.g., of the user's Facebook friends. (Or, such checkingcould also be performed by the GPU, or by another processor in the cellphone.)

(It is interesting to note that this facial recognition—like othersdetailed in this specification—distills the volume of data, e.g., frommillions of pixels (bytes) in the originally captured image, to akeyvector that may comprise a few tens, hundreds, or thousands of bytes.This smaller parcel of information, with its denser information content,is more quickly routed for processing—sometimes externally.Communication of the distilled keyvector information takes place over achannel with a corresponding bandwidth capability—keeping costsreasonable and implementation practical.)

Contrast the just-described GPU implementation of face detection to suchan operation as it might be implemented on a scalar processor.Performing Hough-transform-based oval detection across the entire imageframe is prohibitive in terms of processing time—much of the effortwould be for naught, and would delay other tasks assigned to theprocessor. Instead, such an implementation would typically have theprocessor examine pixels as they come from the camera—looking for thosehaving color within an expected “skintone” range. Only if a region ofskintone pixels is identified would a Hough transform then be attemptedon that excerpt of the image data. In similar fashion, attempting toextract facial parameters from detected ovals would be done in alaborious serial fashion—often yielding no useful result.

Ambient Light

Many artificial light sources do not provide a consistent illumination.Most exhibit a temporal variation in intensity (luminance) and/or color.These variations commonly track the AC power frequency (50/60 or 100/120Hz), but sometimes do not. For example, fluorescent tubes can give offinfrared illumination that varies at a ˜40 KHz rate. The emitted spectradepend on the particular lighting technology. Organic LEDs for domesticand industrial lighting sometimes can use distinct color mixtures (e.g.,blue and amber) to make white. Others employ more traditionalred/green/blue clusters, or blue/UV LEDs with phosphors.

In one particular implementation, a processing stage 38 monitors, e.g.,the average intensity, redness, greenness or other coloration of theimage data contained in the bodies of packets. This intensity data canbe applied to an output 33 of that stage. With the image data, eachpacket can convey a timestamp indicating the particular time (absolute,or based on a local clock) at which the image data was captured. Thistime data, too, can be provided on output 33.

A synchronization processor 35 coupled to such an output 33 can examinethe variation in frame-to-frame intensity (or color), as a function oftimestamp data, to discern its periodicity. Moreover, this module canpredict the next time instant at which the intensity (or color) willhave a maxima, minima, or other particular state. A phase-locked loopmay control an oscillator that is synced to mirror the periodicity of anaspect of the illumination. More typically, a digital filter computes atime interval that is used to set or compare against timers—optionallywith software interrupts. A digital phased-locked loop or delay-lockedloop can also be used. (A Kalman filter is commonly used for this typeof phase locking.)

Control processor module 36 can poll the synchronization module 35 todetermine when a lighting condition is expected to have a desired state.With this information, control processor module 36 can direct setupmodule 34 to capture a frame of data under favorable lighting conditionsfor a particular purpose. For example, if the camera is imaging anobject suspected of having a digital watermark encoded in a green colorchannel, processor 36 may direct camera 32 to capture a frame of imageryat an instant that green illumination is expected to be at a maximum,and direct processing stages 38 to process that frame for detection ofsuch a watermark.

The camera phone may be equipped with plural LED light sources that areusually operated in tandem to produce a flash of white lightillumination on a subject. Operated individually or in differentcombinations, however, they can cast different colors of light on thesubject. The phone processor may control the component LED sourcesindividually, to capture frames with non-white illumination. Ifcapturing an image that is to be read to decode a green-channelwatermark, only green illumination may be applied when the frame iscaptured. Or a camera may capture plural successive frames—withdifferent LEDs illuminating the subject. One frame may be captured at a1/250^(th) second exposure with a corresponding period of red-onlyillumination; a subsequent frame may be captured at a 1/100^(th) secondexposure with a corresponding period of green-only illumination, etc.These frames may be analyzed separately, or may be combined, e.g., foranalysis in the aggregate. Or a single frame of imagery may be capturedover an interval of 1/100^(th) of a second, with the green LED activatedfor that entire interval, and the red LED activated for 1/250^(th) of asecond during that 1/100^(th) second interval. The instantaneous ambientillumination can be sensed (or predicted, as above), and the componentLED colored light sources can be operated in a responsive manner (e.g.,to counteract orangeness of tungsten illumination by adding blueillumination from a blue LED).

Other Notes; Projectors

While a packet-based, data driven architecture is shown in FIG. 16, avariety of other implementations are of course possible. Suchalternative architectures are straightforward to the artisan, based onthe details given.

The artisan will appreciate that the arrangements and details notedabove are arbitrary. Actual choices of arrangement and detail willdepend on the particular application being served, and most likely willbe different than those noted. (To cite but a trivial example, FFTs neednot be performed on 16×16 blocks, but can be done on 64×64, 256×256, thewhole image, etc.)

Similarly, it will be recognized that the body of a packet can convey anentire frame of data, or just excerpts (e.g., a 128×128 block). Imagedata from a single captured frame may thus span a series of severalpackets. Different excerpts within a common frame may be processeddifferently, depending on the packet with which they are conveyed.

Moreover, a processing stage 38 may be instructed to break a packet intomultiple packets—such as by splitting image data into 16 tiled smallersub-images. Thus, more packets may be present at the end of the systemthan were produced at the beginning.

In like fashion, a single packet may contain a collection of data from aseries of different images (e.g., images taken sequentially—withdifferent focus, aperture, or shutter settings; a particular example isa set of focus regions from five images taken with focus bracketing, ordepth of field bracketing—overlapping, abutting, or disjoint.) This setof data may then be processed by later stages—either as a set, orthrough a process that selects one or more excerpts of the packetpayload that meet specified criteria (e.g., a focus sharpness metric).

In the particular example detailed, each processing stage 38 generallysubstituted the result of its processing for the data originallyreceived in the body of the packet. In other arrangements this need notbe the case. For example, a stage may output a result of its processingto a module outside the depicted processing chain, e.g., on an output33. (Or, as noted, a stage may maintain—in the body of the outputpacket—the data originally received, and augment it with furtherdata—such as the result(s) of its processing.)

Reference was made to determining focus by reference to DCT frequencyspectra, or edge detected data. Many consumer cameras perform a simplerform of focus check—simply by determining the intensity difference(contrast) between pairs of adjacent pixels. This difference peaks withcorrect focus. Such an arrangement can naturally be used in the detailedarrangements. (Again, advantages can accrue from performing suchprocessing on the sensor chip.)

Each stage typically conducts a handshaking exchange with an adjoiningstage—each time data is passed to or received from the adjoining stage.Such handshaking is routine to the artisan familiar with digital systemdesign, so is not belabored here.

The detailed arrangements contemplated a single image sensor. However,in other embodiments, multiple image sensors can be used. In addition toenabling conventional stereoscopic processing, two or more image sensorsenable or enhance many other operations.

One function that benefits from multiple cameras is distinguishingobjects. To cite a simple example, a single camera is unable todistinguish a human face from a picture of a face (e.g., as may be foundin a magazine, on a billboard, or on an electronic display screen). Withspaced-apart sensors, in contrast, the 3D aspect of the picture canreadily be discerned, allowing a picture to be distinguished from aperson. (Depending on the implementation, it may be the 3D aspect of theperson that is actually discerned.)

Another function that benefits from multiple cameras is refinement ofgeolocation. From differences between two images, a processor candetermine the device's distance from landmarks whose location may beprecisely known. This allows refinement of other geolocation dataavailable to the device (e.g., by WiFi node identification, GPS, etc.)

Just as a cell phone may have one, two (or more) sensors, such a devicemay also have one, two (or more) projectors. Individual projectors arebeing deployed in cell phones by CKing (the N70 model, distributed byChinaVision) and Samsung (the MPB200). LG and others have shownprototypes. (These projectors are understood to use Texas Instrumentselectronically-steerable digital micro-mirror arrays, in conjunctionwith LED or laser illumination.) Microvision offers the PicoP DisplayEngine, which can be integrated into a variety of devices to yieldprojector capability, using a micro-electro-mechanical scanning mirror(in conjunction with laser sources and an optical combiner). Othersuitable projection technologies include 3M's liquid crystal on silicon(LCOS) and Displaytech's ferroelectric LCOS systems.

Use of two projectors, or two cameras, gives differentials of projectionor viewing, providing additional information about the subject. Inaddition to stereo features, it also enables regional image correction.For example, consider two cameras imaging a digitally watermarkedobject. One camera's view of the object gives one measure of a transformthat can be discerned from the object's surface (e.g., by encodedcalibration signals). This information can be used to correct a view ofthe object by the other camera. And vice versa. The two cameras caniterate, yielding a comprehensive characterization of the objectsurface. (One camera may view a better-illuminated region of thesurface, or see some edges that the other camera can't see. One view maythus reveal information that the other does not.)

If a reference pattern (e.g., a grid) is projected onto a surface, theshape of the surface is revealed by distortions of the pattern. The FIG.16 architecture can be expanded to include a projector, which projects apattern onto an object, for capture by the camera system. (Operation ofthe projector can be synchronized with operation of the camera, e.g., bycontrol processor module 36—with the projector activated only asnecessary, since it imposes a significant battery drain.) Processing ofthe resulting image by modules 38 (local or remote) provides informationabout the surface topology of the object. This 3D topology informationcan be used as a clue in identifying the object.

In addition to providing information about the 3D configuration of anobject, shape information allows a surface to be virtually re-mapped toany other configuration, e.g., flat. Such remapping serves as a sort ofnormalization operation.

In one particular arrangement, system 30 operates a projector to projecta reference pattern into the camera's field of view. While the patternis being projected, the camera captures a frame of image data. Theresulting image is processed to detect the reference pattern, andtherefrom characterize the 3D shape of an imaged object. Subsequentprocessing then follows, based on the 3D shape data.

(In connection with such arrangements, the reader is referred to theGoogle book-scanning patent, U.S. Pat. No. 7,508,978, which employsrelated principles. That patent details a particularly useful referencepattern, among other relevant disclosures.)

If the projector uses collimated laser illumination (such as the PicoPDisplay Engine), the pattern will be in focus regardless of distance tothe object onto which the pattern is projected. This can be used as anaid to adjust focus of a cell phone camera onto an arbitrary subject.Because the projected pattern is known in advance by the camera, thecaptured image data can be processed to optimize detection of thepattern—such as by correlation. (Or the pattern can be selected tofacilitate detection—such as a checkerboard that appears strongly at asingle frequency in the image frequency domain when properly focused.)Once the camera is adjusted for optimum focus of the known, collimatedpattern, the projected pattern can be discontinued, and the camera canthen capture a properly focused image of the underlying subject ontowhich the pattern was projected.

Synchronous detection can also be employed. The pattern may be projectedduring capture of one frame, and then off for capture of the next. Thetwo frames can then be subtracted. The common imagery in the two framesgenerally cancels—leaving the projected pattern at a much higher signalto noise ratio.

A projected pattern can be used to determine correct focus for severalsubjects in the camera's field of view. A child may pose in front of theGrand Canyon. The laser-projected pattern allows the camera to focus onthe child in a first frame, and on the background in a second frame.These frames can then be composited—taking from each the portionproperly in focus.

If a lens arrangement is used in the cell phone's projector system, itcan also be used for the cell phone's camera system. A mirror can becontrollably moved to steer the camera or the projector to the lens. Ora beam-splitter arrangement 80 can be used (FIG. 20). Here the body of acell phone 81 incorporates a lens 82, which provides a light to abeam-splitter 84. Part of the illumination is routed to the camerasensor 12. The other part of the optical path goes to a micro-mirrorprojector system 86.

Lenses used in cell phone projectors typically are larger aperture thanthose used for cell phone cameras, so the camera may gain significantperformance advantages (e.g., enabling shorter exposures) by use of sucha shared lens. Or, reciprocally, the beam splitter 84 can beasymmetrical—not equally favoring both optical paths. For example, thebeam-splitter can be a partially-silvered element that couples a smallerfraction (e.g., 2%, 8%, or 25%) of externally incident light to thesensor path 83. The beam-splitter may thus serve to couple a largerfraction (e.g., 98%, 92%, or 75%) of illumination from the micro-mirrorprojector externally, for projection. By this arrangement the camerasensor 12 receives light of a conventional—for a cell phonecamera—intensity (notwithstanding the larger aperture lens), while thelight output from the projector is only slightly dimmed by the lenssharing arrangement.

In another arrangement, a camera head is separate—or detachable—from thecell phone body. The cell phone body is carried in a user's pocket orpurse, while the camera head is adapted for looking out over a user'spocket (e.g., in a form factor akin to a pen, with a pocket clip, andwith a battery in the pen barrel). The two communicate by Bluetooth orother wireless arrangement, with capture instructions sent from thephone body, and image data sent from the camera head. Such configurationallows the camera to constantly survey the scene in front of theuser—without requiring that the cell phone be removed from the user'spocket/purse.

In a related arrangement, a strobe light for the camera is separate—ordetachable—from the cell phone body. The light (which may incorporateLEDs) can be placed near the image subject, providing illumination froma desired angle and distance. The strobe can be fired by a wirelesscommand issued by the cell phone camera system.

(Those skilled in optical system design will recognize a number ofalternatives to the arrangements particularly noted.)

Some of the advantages that accrue from having two cameras can berealized by having two projectors (with a single camera). For example,the two projectors can project alternating or otherwise distinguishablepatterns (e.g., simultaneous, but of differing color, pattern,polarization, etc) into the camera's field of view. By noting how thetwo patterns—projected from different points—differ when presented on anobject and viewed by the camera, stereoscopic information can again bediscerned.

Many usage models are enabled through use a projector, including newsharing models (c.f., Greaves, “View & Share: Exploring Co-PresentViewing and Sharing of Pictures using Personal Projection,” MobileInteraction with the Real World 2009). Such models employ the imagecreated by the projector itself as a trigger to initiate a sharingsession, either overtly through a commonly understood symbol (“open”sign), to covert triggers that are machine readable. Sharing can alsooccur through ad hoc networks utilizing peer to peer applications, or aserver hosted application.

Other output from mobile devices can be similarly shared. Considerkeyvectors. One user's phone may process an image with Hough transformand other eigenface extraction techniques, and then share the resultingkeyvector of eigenface data with others in the user's social circle(either by pushing same to them, or allowing them to pull it). One ormore of these socially-affiliated devices may then perform facialtemplate matching that yields an identification of aformerly-unrecognized face in the imagery captured by the original user.Such arrangement takes a personal experience, and makes it a publicexperience. Moreover, the experience can become a viral experience, withthe keyvector data shared—essentially without bounds—to a great numberof further users.

Selected Other Arrangements

In addition to the arrangements earlier detailed, another hardwarearrangement suitable for use with the present technology uses theMali-400 ARM graphics multiprocessor architecture, which includes pluralfragment processors that can be devoted to the different types of imageprocessing tasks referenced in this document.

The standards group Khronos has issued OpenGL ES2.0, which defineshundreds of standardized graphics function calls for systems thatinclude multiple CPUs and multiple GPUs (a direction in which cellphones are increasingly migrating). OpenGL ES2.0 attends to routing ofdifferent operations to different of the processing units—with suchdetails being transparent to the application software. It thus providesa consistent software API usable with all manner of GPU/CPU hardware.

In accordance with another aspect of the present technology, OpenGL ES2.standard is extended to provide a standardized graphics processinglibrary not just across different CPU/GPU hardware, but also acrossdifferent cloud processing hardware—again with such details beingtransparent to the calling software.

Increasingly, Java service requests (JSRs) have been defined tostandardize certain Java-implemented tasks. JSRs increasingly aredesigned for efficient implementations on top of OpenGL ES2.0 classhardware.

In accordance with a still further aspect of the present technology,some or all of the image processing operations noted in thisspecification (facial recognition, SIFT processing, watermark detection,histogram processing, etc.) can be implemented as JSRs—providingstandardized implementations that are suitable across diverse hardwareplatforms.

In addition to supporting cloud-based JSRs, the extended standardsspecification can also support the Query Router and Response Managerfunctionality detailed earlier—including both static and auction-basedservice providers.

Akin to OpenGL is OpenCV—a computer vision library available under anopen source license, permitting coders to invoke a variety offunctions—without regard to the particular hardware that is beingutilized to perform same. (An O'Reilly book, Learning OpenCV, documentsthe language extensively.) A counterpart, NokiaCV, provides similarfunctionality specialized for the Symbian operating system (e.g., Nokiacell phones).

OpenCV provides support for a large variety of operations, includinghigh level tasks such as facial recognition, gesture recognition, motiontracking/understanding, segmentation, etc., as well as an extensiveassortment of more atomic, elemental vision/image processing operations.

CMVision is another package of computer vision tools that can beemployed in embodiments of the present technology—this package compiledby researchers at Carnegie Mellon University.

Still another hardware architecture makes use of a field programmableobject array (FPOA) arrangement, in which hundreds of diverse 16-bit“objects” are arrayed in a gridded node fashion, with each being able toexchange data with neighboring devices through very high bandwidthchannels. (The PicoChip devices referenced earlier are of this class.)The functionality of each can be reprogrammed, as with FPGAs. Againdifferent of the image processing tasks can be performed by different ofthe FPOA objects. These tasks can be redefined on the fly, as needed(e.g., an object may perform SIFT processing in one state; FFTprocessing in another state; log-polar processing in a further state,etc.).

(While many grid arrangements of logic devices are based on “nearestneighbor” interconnects, additional flexibility can be achieved by useof a “partial crossbar” interconnect. See, e.g., U.S. Pat. No. 5,448,496(Quickturn Design Systems).)

Also in the realm of hardware, certain embodiments of the presenttechnology employ “extended depth of field” imaging systems (see, e.g.,U.S. Pat. Nos. 7,218,448, 7,031,054 and U.S. Pat. No. 5,748,371). Sucharrangements include a mask in the imaging path that modifies theoptical transfer function of the system so as to be insensitive to thedistance between the object and the imaging system. The image quality isthen uniformly poor over the depth of field. Digital post processing ofthe image compensates for the mask modifications, restoring imagequality, but retaining the increased depth of field. Using suchtechnology, the cell phone camera can capture imagery having both nearerand further subjects all in focus (i.e., with greater high frequencydetail), without requiring longer exposures—as would normally berequired. (Longer exposures exacerbate problems such as hand-jitter, andmoving subjects.) In the arrangements detailed here, shorter exposuresallow higher quality imagery to be provided to image processingfunctions without enduring the temporal delay created byoptical/mechanical focusing elements, or requiring input from the useras to which elements of the image should be in focus. This provides fora much more intuitive experience, as the user can simply point theimaging device at the desired target without worrying about focus ordepth of field settings. Similarly, the image processing functions areable to leverage all the pixels included in the image/frame captured, asall are expected to be in-focus. In addition, new metadata regardingidentified objects or groupings of pixels related to depth within theframe can produce simple “depth map” information, setting the stage for3D video capture and storage of video streams using emerging standardson transmission of depth information.

In some embodiments the cell phone may have the capability to perform agiven operation locally, but may decide instead to have it performed bya cloud resource. The decision of whether to process locally or remotelycan be based on “costs,” including bandwidth costs, external serviceprovider costs, power costs to the cell phone battery, intangible costsin consumer (dis-)satisfaction by delaying processing, etc. For example,if the user is running low on battery power, and is at a location farfrom a cell tower (so that the cell phone runs its RF amplifier atmaximum output when transmitting), then sending a large block of datafor remote processing may consume a significant fraction of thebattery's remaining life. In such case, the phone may decide to processthe data locally, or to forward it for remote processing when the phoneis closer to the cell site or the battery has been recharged. A set ofstored rules can be applied to the relevant variables to establish a net“cost function” for different approaches (e.g., process locally, processremotely, defer processing), and these rules may indicate differentoutcomes depending on the states of these variables.

An appealing “cloud” resource is the processing capability found at theedges of wireless networks. Cellular networks, for example, includetower stations that are, in large part, software-definedradios—employing processors to perform—digitally—some or all of theoperations traditionally performed by analog transmitting and receivingradio circuits, such as mixers, filters, demodulators, etc. Even smallercell stations, so-called “femtocells,” typically have powerful signalprocessing hardware for such purposes. The PicoChip processors notedearlier, and other field programmable object arrays, are widely deployedin such applications.

Radio signal processing, and image signal processing, have manycommonalities, e.g., employing FFT processing to convert sampled data tothe frequency domain, applying various filtering operations, etc. Cellstation equipment, including processors, are designed to meet peakconsumer demands. This means that significant processing capability isoften left unused.

In accordance with another aspect of the present technology, this spareradio signal processing capability at cellular tower stations (and otheredges of wireless networks) is repurposed in connection with image(and/or audio or other) signal processing for consumer wireless devices.Since an FFT operation is the same—whether processing sampled radiosignals or image pixels—the repurposing is often straightforward:configuration data for the hardware processing cores needn't be changedmuch, if at all. And because 3G/4G networks are so fast, a processingtask can be delegated quickly from a consumer device to a cell stationprocessor, and the results returned with similar speed. In addition tothe speed and computational muscle that such repurposing of cell stationprocessors affords, another benefit is reducing the power consumption ofthe consumer devices.

Before sending image data for processing, a cell phone can quicklyinquire of the cell tower station with which it is communicating toconfirm that it has enough unused capacity sufficient to undertake theintended image processing operation. This query can be sent by thepackager/router of FIG. 10; the local/remote router of FIG. 10A, thequery router and response manager of FIG. 7; the pipe manager 51 of FIG.16, etc.

Alerting the cell tower/base station of forthcoming processing requests,and/or bandwidth requirements, allows the cell site to better allocateits processing and bandwidth resources in anticipation of meeting suchneeds.

Cell sites are at risk of becoming bottlenecked: undertaking serviceoperations that exhaust their processing or bandwidth capacity. Whenthis occurs, they must triage by unexpectedly throttling back theprocessing/bandwidth provided to one or more users, so others can beserved. This sudden change in service is undesirable, since changing theparameters with which the channel was originally established (e.g., thebit rate at which video can be delivered), forces data services usingthe channel to reconfigure their respective parameters (e.g., requiringESPN to provide a lower quality video feed). Renegotiating such detailsonce the channel and services have been originally setup invariablycauses glitches, e.g., video delivery stuttering, dropped syllables inphone calls, etc.

To avoid the need for these unpredictable bandwidth slowdowns, andresulting service impairments, cell sites tend to adopt a conservativestrategy—allocating bandwidth/processing resources parsimoniously, inorder to reserve capacity for possible peak demands. But this approachimpairs the quality of service that might otherwise be normallyprovided—sacrificing typical service in anticipation of the unexpected.

In accordance with this aspect of the present technology, a cell phonesends alerts to the cell tower station, specifying bandwidth orprocessing needs that it anticipates will be forthcoming. In effect, thecell phone asks to reserve a bit of future service capacity. The towerstation still has a fixed capacity. However, knowing that a particularuser will be needing, e.g., a bandwidth of 8 Mbit/s for 3 seconds,commencing in 200 milliseconds, allows the cell site to take suchanticipated demand into account as it serves other users.

Consider a cell site having an excess (allocable) channel capacity of 15Mbit/s, which normally allocates to a new video service user a channelof 10 Mbit/s. If the site knows that a cell camera user has requestedreservation for a 8 Mbit/s channel starting in 200 milliseconds, and anew video service user meanwhile requests service, the site may allocatethe new video service user a channel of 7 Mbit/s, rather than the usual10 Mbit/s. By initially setting up the new video service user's channelat the slower bit rate, service impairments associated with cutting backbandwidth during an ongoing channel session are avoided. The capacity ofthe cell site is the same, but it is now allocated in manner thatreduces the need for reducing the bandwidth of existing channels,mid-transmission.

In another situation, the cell site may determine that it has excesscapacity at present, but expects to be more heavily burdened in a halfsecond. In this case it may use the present excess capacity to speedthroughput to one or more video subscribers, e.g., those for whom it hascollected several packets of video data in a buffer memory, ready fordelivery. These video packets may be sent through the enlarged channelnow, in anticipation that the video channel will be slowed in a halfsecond. Again, this is practical because the cell site has usefulinformation about future bandwidth demands.

The service reservation message sent from the cell phone may alsoinclude a priority indicator. This indicator can be used by the cellsite to determine the relative importance of meeting the request on thestated terms, in case arbitration between conflicting service demands isrequired.

Such anticipatory service requests from cell phones can also allow thecell site to provide higher quality sustained service than wouldnormally be allocated.

Cell sites are understood to employ statistical models of usagepatterns, and allocate bandwidth accordingly. The allocations aretypically set conservatively, in anticipation of realistic worst caseusage scenarios, e.g., encompassing scenarios that occur 99.99% of thetime. (Some theoretically possible scenarios are sufficiently improbablethat they may be disregarded in bandwidth allocations. However, on therare occasions when such improbable scenarios occur—as when thousands ofsubscribers sent cell phone picture messages from Washington DC duringthe Obama inauguration, some subscribers may simply not receiveservice.)

The statistical models on which site bandwidth allocations are based,are understood to treat subscribers—in part—as unpredictable actors.Whether a particular subscriber requests service in the forthcomingseconds (and what particular service is requested) has a random aspect.

The larger the randomness in a statistical model, the larger theextremes tend to be. If reservations, or forecasts of future demands,are routinely submitted by, e.g., 15% of subscribers, then the behaviorof those subscribers is no longer random. The worst case peak bandwidthdemand on a cell site does not involve 100% of the subscribers actingrandomly, but only 85%. Actual reservation information can be employedfor the other 15%. Hypothetical extremes in peak bandwidth usage arethus moderated.

With lower peak usage scenarios, more generous allocations of presentbandwidth can be granted to all subscribers. That is, if a portion ofthe user base sends alerts to the site reserving future capacity, thenthe site may predict that the realistic peak demand that may beforthcoming will still leave the site with unused capacity. In this caseit may grant a camera cell phone user a 12 Mbit/s channel—instead of the8 Mbit/s channel stated in the reservation request, and/or may grant avideo user a 15 Mbit/s channel instead of the normal 10 Mbit/s channel.Such usage forecasting can thus allow the site to grant higher qualityservices than would normally be the case, since bandwidth reserves needbe held for a lesser number of unpredictable actors.

Anticipatory service requests can also be communicated from the cellphone (or the cell site) to other cloud processes that are expected tobe involved in the requested services, allowing them to similarlyallocate their resources anticipatorily. Such anticipatory servicerequests may also serve to alert the cloud process to pre-warmassociated processing. Additional information may be provided from thecell phone, or elsewhere, for this purpose, such as encryption keys,image dimensions (e.g., to configure a cloud FPOA to serve as an FFTprocessor for a 1024×768 image, to be processed in 16×16 tiles, andoutput coefficients for 32 spectral frequency bands), etc.

In turn, the cloud resource may alert the cell phone of any informationit expects might be requested from the phone in performance of theexpected operation, or action it might request the cell phone toperform, so that the cell phone can similarly anticipate its ownforthcoming actions and prepare accordingly. For example, the cloudprocess may, under certain conditions, request a further set of inputdata, such as if it assesses that data originally provided is notsufficient for the intended purpose (e.g., the input data may be animage without sufficient focus resolution, or not enough contrast, orneeding further filtering). Knowing, in advance, that the cloud processmay request such further data can allow the cell phone to consider thispossibility in its own operation, e.g., keeping processing modulesconfigured in a certain filter manner longer than may otherwise be thecase, reserving an interval of sensor time to possibly capture areplacement image, etc.

Anticipatory service requests (or the possibility of conditional servicerequests) generally relate to events that may commence in few tens orhundreds of milliseconds—occasionally in a few single seconds.Situations in which the action will commence tens or hundreds of secondin the future will be rare. However, while the period of advance warningmay be short, significant advantages can be derived: if the randomnessof the next second is reduced—each second, then system randomness can bereduced considerably. Moreover, the events to which the requests relatecan, themselves, be of longer duration—such as transmission of a largeimage file, which may take ten seconds or more.

Regarding advance set-up (pre-warming), desirably any operation thattakes more than a threshold interval of time to complete (e.g., a fewhundred microseconds, a millisecond, ten microseconds, etc.—depending onimplementation) should be prepped anticipatorily, if possible. (In someinstances, of course, the anticipated service is never requested, inwhich case such preparation may be for naught.)

In another hardware arrangement, the cell phone processor mayselectively activate a Peltier device or other thermoelectric coolercoupled to the image sensor, in circumstances when thermal image noise(Johnson noise) is a potential problem. For example, if a cell phonedetects a low light condition, it may activate a cooler on the sensor totry and enhance the image signal to noise ratio. Or the image processingstages can examine captured imagery for artifacts associated withthermal noise, and if such artifacts exceed a threshold, then thecooling device can be activated. (One approach captures a patch ofimagery, such as a 16×16 pixel region, twice in quick succession. Absentrandom factors, the two patches should be identical—perfectlycorrelated. The variance of the correlation from 1.0 is a measure ofnoise—presumably thermal noise.) A short interval after the coolingdevice is activated, a substitute image can be captured—the intervaldepending on thermal response time for the cooler/sensor. Likewise ifcell phone video is captured, a cooler may be activated, since theincreased switching activity by circuitry on the sensor increases itstemperature, and thus its thermal noise. (Whether to activate a coolercan also be application dependent, e.g., the cooler may be activatedwhen capturing imagery from which watermark data may be read, but notactivated when capturing imagery from which barcode data may be read.)

As noted, packets in the FIG. 16 arrangement can convey a variety ofinstructions and data—in both the header and the packet body. In afurther arrangement a packet can additionally, or alternatively, containa pointer to a cloud object, or to a record in a database. The cloudobject/database record may contain information such as objectproperties, useful for object recognition (e.g., fingerprint orwatermark properties for a particular object).

If the system has read a watermark, the packet may contain the watermarkpayload, and the header (or body) may contain one or more databasereferences where that payload can be associated with relatedinformation. A watermark payload read from a business card may belooked-up in one database; a watermark decoded from a photograph may belooked-up in another database, etc. A system may apply multipledifferent watermark decoding algorithms to a single image (e.g.,MediaSec, Digimarc ImageBridge, Civolution, etc.). Depending on whichapplication performed a particular decoding operation, the resultingwatermark payload may be sent off to a corresponding destinationdatabase. (Likewise with different barcodes, fingerprint algorithms,eigenface technologies, etc.) The destination database address can beincluded in the application, or in configuration data. (Commonly, theaddressing is performed indirectly, with an intermediate data storecontaining the address of the ultimate database, permitting relocationof the database without changing each cell phone application.)

The system may perform a FFT on captured image data to obtain frequencydomain information, and then feed that information to several watermarkdecoders operating in parallel—each applying a different decodingalgorithm. When one of the applications extracts valid watermark data(e.g., indicated by ECC information computed from the payload), the datais sent to a database corresponding to that format/technology ofwatermark. Plural such database pointers can be included in a packet,and used conditionally—depending on which watermark decoding operation(or barcode reading operation, or fingerprint calculation, etc.) yieldsuseful data.

Similarly, the system may send a facial image to an intermediary cloudservice, in a packet containing an identifier of the user (but notcontaining the user's Apple iPhoto, or Picasa, or Facebook user name).The intermediary cloud service can take the provided user identifier,and use it to access a database record from which the user's names onthese other services are obtained. The intermediary cloud service canthen route the facial image data to an Apple's server—with the user'siPhoto user name; to Picasa's service with the user's Google user name;and to Facebook's server with the user's Facebook user name. Thoserespective services can then perform facial recognition on the imagery,and return the names of identified persons identified from the user'siPhoto/Picasa/Facebook accounts (directly to the user, or through theintermediary service). The intermediate cloud service—which may servelarge numbers of users—can keep informed of the current addresses forrelevant servers (and alternate proximate servers, in case the user isaway from home), rather than have each cell phone try to keep such datain updated fashion.

Facial recognition applications can be used not just to identifypersons, but also to identify relationships between individuals depictedin imagery. For example, data maintained by iPhoto/Picasa/Facebook maycontain not just facial recognition features, and associated names, butalso terms indicating relationships between the named faces and theaccount owner (e.g., father, boyfriend, sibling, pet, roommate, etc.).Thus, instead of simply searching a user's image collection for, e.g.,all pictures of “David Smith” the user's collection may also be searchedfor all pictures depicting “sibling.”

The application software in which photos are reviewed can presentdifferently colored frames around different recognized faces—inaccordance with associated relationship data (e.g., blue for siblings,red for boyfriends, etc.).

In some arrangements, the user's system can access such informationstored in accounts maintained by the user's network “friends.” A facethat may not be recognized by facial recognition data associated withthe user's account at Picasa, may be recognized by consulting Picasafacial recognition data associated with the account of the user's friend“David Smith.” Relationship data indicated by David Smith's account canbe similarly used to present, and organize, the user's photos. Theearlier unrecognized face may thus be labeled with indicia indicatingthe person is David Smith's roommate. This essentially remaps therelationship information (e.g., mapping “roommate”—as indicated in DavidSmith's account, to “David Smith's roommate” in the user's account).

The embodiments detailed above were generally described in the contextof a single network. However, plural networks may commonly be availableto a user's phone (e.g., WiFi, Bluetooth, possibly different cellularnetworks, etc.) The user may choose between these alternatives, or thesystem may apply stored rules to automatically do so. In some instances,a service request may be issued (or results returned) across severalnetworks in parallel.

Reference Platform Architecture

The hardware in cell phones was originally introduced for specificpurposes. The microphone, for example, was used only for voicetransmission over the cellular network: feeding an A/D converter thatfed a modulator in the phone's radio transceiver. The camera was usedonly to capture snapshots. Etc.

As additional applications arose employing such hardware, eachapplication needed to develop its own way to talk to the hardware.Diverse software stacks arose—each specialized so a particularapplication could interact with a particular piece of hardware. Thisposes an impediment to application development.

This problem compounds when cloud services and/or specialized processorsare added to the mix.

To alleviate such difficulties, embodiments of the present technologycan employ an intermediate software layer that provides a standardinterface with which and through which hardware and software caninteract. Such an arrangement is shown in FIG. 20A, with theintermediate software layer being labeled “Reference Platform.”

In this diagram hardware elements are shown in dashed boxes, includingprocessing hardware on the bottom, and peripherals on the left. The box“IC HW” is “intuitive computing hardware,” and comprises theearlier-discussed hardware that supports the different processing ofimage related data, such as modules 38 in FIG. 16, the configurablehardware of FIG. 6, etc. DSP is a general purpose digital signalprocessor, which can be configured to perform specialized operations;CPU is the phone's primary processor; GPU is a graphics processor unit.OpenCL and OpenGL are APIs through which graphics processing services(performed on the CPU and/or GPU) can be invoked.

Different specialized technologies are in the middle, such as one ormore digital watermark decoders (and/or encoders), barcode readingsoftware, optical character recognition software, etc. Cloud servicesare shown on the right, and applications are on the top.

The reference platform establishes a standard interface through whichdifferent applications can interact with hardware, exchange information,and request services (e.g., by API calls). Similarly, the platformestablishes a standard interface through which the differenttechnologies can be accessed, and through which they can send andreceive data to other of the system components. Likewise with the cloudservices, for which the reference platform may also attend to details ofidentifying a service provider—whether by reverse auction, heuristics,etc. In cases where a service is available both from a technology in thecell phone, and from a remote service provider, the reference platformmay also attend to weighing the costs and benefits of the differentoptions, and deciding which should handle a particular service request.

By such arrangement, the different system components do not need toconcern themselves with the details of other parts of the system. Anapplication may call for the system to read text from an object in frontof the cell phone. It needn't concern itself with the particular controlparameters of the image sensor, nor the image format requirements of theOCR engine. An application may call for a read of the emotion of aperson in front of the cell phone. A corresponding call is passed towhatever technology in the phone supports such functionality, and theresults are returned in a standardized form. When an improved technologybecome available, it can be added to the phone, and through thereference platform the system takes advantages of its enhancedcapabilities. Thus, growing/changing collections of sensors, andgrowing/evolving sets of service providers, can be set to the tasks ofderiving meaning from input stimuli (audio as well as visual, e.g.,speech recognition) through use of such an adaptable architecture.

Arasan Chip Systems, Inc. offers a Mobile Industry Processor InterfaceUniPro Software Stack, a layered, kernel-level stack that aims tosimplify integration of certain technologies into cell phones. Thatarrangement may be extended to provide the functionality detailed above.(The Arasan protocol is focused primarily on transport layer issues, butinvolves layers down to hardware drivers as well. The Mobile IndustryProcessor Interface Alliance is a large industry group working toadvance cell phone technologies.)

Leveraging Existing Image Collections, E.g., for Metadata

Collections of publicly-available imagery and other content are becomingmore prevalent. Flickr, YouTube, Photobucket (MySpace), Picasa, Zooomr,FaceBook, Webshots and Google Images are just a few. Often, theseresources can also serve as sources of metadata—either expresslyidentified as such, or inferred from data such as file names,descriptions, etc. Sometimes geo-location data is also available.

An illustrative embodiment according to one aspect of the presenttechnology works as follows. A captures a cell phone picture of anobject, or scene—perhaps a desk telephone, as shown in FIG. 21. (Theimage may be acquired in other manners as well, such as transmitted fromanother user, or downloaded from a remote computer.)

As a preliminary operation, known image processing operations may beapplied, e.g., to correct color or contrast, to performortho-normalization, etc. on the captured image. Known image objectsegmentation or classification techniques may also be used to identifyan apparent subject region of the image, and isolate same for furtherprocessing.

The image data is then processed to determine characterizing featuresthat are useful in pattern matching and recognition. Color, shape, andtexture metrics are commonly used for this purpose. Images may also begrouped based on layout and eigenvectors (the latter being particularlypopular for facial recognition). Many other technologies can of coursebe employed, as noted elsewhere in this specification.

(Uses of vector characterizations/classifications and otherimage/video/audio metrics in recognizing faces, imagery, video, audioand other patterns are well known and suited for use in connection withthe present technology. See, e.g., patent publications 20060020630 and20040243567 (Digimarc), 20070239756 and 20020037083 (Microsoft),20070237364 (Fuji Photo Film), U.S. Pat. No. 7,359,889 and (Shazam),20050180635 (Corel), U.S. Pat. Nos. 6,430,306, 6,681,032 and 20030059124(L-1 Corp.), U.S. Pat. Nos. 7,194,752 and 7,174,293 (Iceberg), U.S. Pat.No. 7,130,466 (Cobion), U.S. Pat. No. 6,553,136 (Hewlett-Packard), andU.S. Pat. No. 6,430,307 (Matsushita), and the journal references citedat the end of this disclosure. When used in conjunction with recognitionof entertainment content such as audio and video, such features aresometimes termed content “fingerprints” or “hashes.”)

After feature metrics for the image are determined, a search isconducted through one or more publicly-accessible image repositories forimages with similar metrics, thereby identifying apparently similarimages. (As part of its image ingest process, Flickr and other suchrepositories may calculate eigenvectors, color histograms, keypointdescriptors, FFTs, or other classification data on images at the timethey are uploaded by users, and collect same in an index for publicsearch.) The search may yield the collection of apparently similartelephone images found in Flickr, depicted in FIG. 22.

Metadata is then harvested from Flickr for each of these images, and thedescriptive terms are parsed and ranked by frequency of occurrence. Inthe depicted set of images, for example, the descriptors harvested fromsuch operation, and their incidence of occurrence, may be as follows:

Cisco (18)

Phone (10)

Telephone (7)

VOIP (7)

IP (5)

7941 (3)

Phones (3)

Technology (3)

7960 (2)

7920 (1)

7950 (1)

Best Buy (1)

Desk (1)

Ethernet (1)

IP-phone (1)

Office (1)

Pricey (1)

Sprint (1)

Telecommunications (1)

Uninett (1)

Work (1)

From this aggregated set of inferred metadata, it may be assumed thatthose terms with the highest count values (e.g., those terms occurringmost frequently) are the terms that most accurately characterize theuser's FIG. 21 image.

The inferred metadata can be augmented or enhanced, if desired, by knownimage recognition/classification techniques. Such technology seeks toprovide automatic recognition of objects depicted in images. Forexample, by recognizing a TouchTone keypad layout, and a coiled cord,such a classifier may label the FIG. 21 image using the terms Telephoneand Facsimile Machine.

If not already present in the inferred metadata, the terms returned bythe image classifier can be added to the list and given a count value.(An arbitrary value, e.g., 2, may be used, or a value dependent on theclassifier's reported confidence in the discerned identification can beemployed.)

If the classifier yields one or more terms that are already present, theposition of the term(s) in the list may be elevated. One way to elevatea term's position is by increasing its count value by a percentage(e.g., 30%). Another way is to increase its count value to one greaterthan the next-above term that is not discerned by the image classifier.(Since the classifier returned the term “Telephone” but not the term“Cisco,” this latter approach could rank the term Telephone with a countvalue of “19”—one above Cisco.) A variety of other techniques foraugmenting/enhancing the inferred metadata with that resulting from theimage classifier are straightforward to implement.

A revised listing of metadata, resulting from the foregoing, may be asfollows:

Telephone (19)

Cisco (18)

Phone (10)

VOIP (7)

IP (5)

7941 (3)

Phones (3)

Technology (3)

7960 (2)

Facsimile Machine (2)

7920 (1)

7950 (1)

Best Buy (1)

Desk (1)

Ethernet (1)

IP-phone (1)

Office (1)

Pricey (1)

Sprint (1)

Telecommunications (1)

Uninett (1)

Work (1)

The list of inferred metadata can be restricted to those terms that havethe highest apparent reliability, e.g., count values. A subset of thelist comprising, e.g., the top N terms, or the terms in the top Mthpercentile of the ranked listing, may be used. This subset can beassociated with the FIG. 21 image in a metadata repository for thatimage, as inferred metadata.

In the present example, if N=4, the terms Telephone, Cisco, Phone andVOIP are associated with the FIG. 21 image.

Once a list of metadata is assembled for the FIG. 21 image (by theforegoing procedure, or others), a variety of operations can beundertaken.

One option is to submit the metadata, along with the captured content ordata derived from the captured content (e.g., the FIG. 21 image, imagefeature data such as eigenvectors, color histograms, keypointdescriptors, FFTs, machine readable data decoded from the image, etc),to a service provider that acts on the submitted data, and provides aresponse to the user. Shazam, Snapnow (now LinkMe Mobile), ClusterMediaLabs, Snaptell (now part of Amazon's A9 search service), Mobot, MobileAcuity, Nokia Point & Find, Kooaba, idee TinEye, iVisit's SeeScan,Evolution Robotics' ViPR, IQ Engine's oMoby, and Digimarc Mobile, are afew of several commercially available services that capture mediacontent, and provide a corresponding response; others are detailed inthe earlier-cited patent publications. By accompanying the content datawith the metadata, the service provider can make a more informedjudgment as to how it should respond to the user's submission.

The service provider—or the user's device—can submit the metadatadescriptors to one or more other services, e.g., a web search enginesuch as Google, to obtain a richer set of auxiliary information that mayhelp better discern/infer/intuit an appropriate desired by the user. Orthe information obtained from Google (or other such database resource)can be used to augment/refine the response delivered by the serviceprovider to the user. (In some cases, the metadata—possibly accompaniedby the auxiliary information received from Google—can allow the serviceprovider to produce an appropriate response to the user, without evenrequiring the image data.)

In some cases, one or more images obtained from Flickr may besubstituted for the user's image. This may be done, for example, if aFlickr image appears to be of higher quality (using sharpness,illumination histogram, or other measures), and if the image metrics aresufficiently similar. (Similarity can be judged by a distance measureappropriate to the metrics being used. One embodiment checks whether thedistance measure is below a threshold. If several alternate images passthis screen, then the closest image is used.) Or substitution may beused in other circumstances. The substituted image can then be usedinstead of (or in addition to) the captured image in the arrangementsdetailed herein.

In one such arrangement, the substitute image data is submitted to theservice provider. In another, data for several substitute images aresubmitted. In another, the original image data—together with one or morealternative sets of image data—are submitted. In the latter two cases,the service provider can use the redundancy to help reduce the chance oferror—assuring an appropriate response is provided to the user. (Or theservice provider can treat each submitted set of image dataindividually, and provide plural responses to the user. The clientsoftware on the cell phone can then assess the different responses, andpick between them (e.g., by a voting arrangement), or combine theresponses, to help provide the user an enhanced response.)

Instead of substitution, one or more related public image(s) may becomposited or merged with the user's cell phone image. The resultinghybrid image can then be used in the different contexts detailed in thisdisclosure.

A still further option is to use apparently-similar images gleaned fromFlickr to inform enhancement of the user's image. Examples include colorcorrection/matching, contrast correction, glare reduction, removingforeground/background objects, etc. By such arrangement, for example,such a system may discern that the FIG. 21 image has foregroundcomponents (apparently Post-It notes) on the telephone that should bemasked or disregarded. The user's image data can be enhancedaccordingly, and the enhanced image data used thereafter.

Relatedly, the user's image may suffer some impediment, e.g., such asdepicting its subject from an odd perspective, or with poor lighting,etc. This impediment may cause the user's image not to be recognized bythe service provider (i.e., the image data submitted by the user doesnot seem to match any image data in the database being searched). Eitherin response to such a failure, or proactively, data from similar imagesidentified from Flickr may be submitted to the service provider asalternatives—hoping they might work better.

Another approach—one that opens up many further possibilities—is tosearch Flickr for one or more images with similar image metrics, andcollect metadata as described herein (e.g., Telephone, Cisco, Phone,VOIP). Flickr is then searched a second time, based on metadata. Pluralimages with similar metadata can thereby be identified. Data for thesefurther images (including images with a variety of differentperspectives, different lighting, etc.) can then be submitted to theservice provider—notwithstanding that they may “look” different than theuser's cell phone image.

When doing metadata-based searches, identity of metadata may not berequired. For example, in the second search of Flickr just-referenced,four terms of metadata may have been associated with the user's image:Telephone, Cisco, Phone and VOIP. A match may be regarded as an instancein which a subset (e.g., three) of these terms is found.

Another approach is to rank matches based on the rankings of sharedmetadata terms. An image tagged with Telephone and Cisco would thus beranked as a better match than an image tagged with Phone and VOIP. Oneadaptive way to rank a “match” is to sum the counts for the metadatadescriptors for the user's image (e.g., 19+18+10+7=54), and then tallythe count values for shared terms in a Flickr image (e.g., 35, if theFlickr image is tagged with Cisco, Phone and VOIP). The ratio can thenbe computed (35/54) and compared to a threshold (e.g., 60%). In thiscase, a “match” is found. A variety of other adaptive matchingtechniques can be devised by the artisan.

The above examples searched Flickr for images based on similarity ofimage metrics, and optionally on similarity of textual (semantic)metadata. Geolocation data (e.g., GPS tags) can also be used to get ametadata toe-hold.

If the user captures an arty, abstract shot of the Eiffel tower fromamid the metalwork or another unusual vantage point (e.g., FIG. 29), itmay not be recognized—from image metrics—as the Eiffel tower. But GPSinfo captured with the image identifies the location of the imagesubject. Public databases (including Flickr) can be employed to retrievetextual metadata based on GPS descriptors. Inputting GPS descriptors forthe photograph yields the textual descriptors Paris and Eiffel.

Google Images, or another database, can be queried with the terms Eiffeland Paris to retrieve other, more perhaps conventional images of theEiffel tower. One or more of those images can be submitted to theservice provider to drive its process. (Alternatively, the GPSinformation from the user's image can be used to search Flickr forimages from the same location; yielding imagery of the Eiffel Tower thatcan be submitted to the service provider.)

Although GPS is gaining in camera-metadata-deployment, most imagerypresently in Flickr and other public databases is missing geolocationinfo. But GPS info can be automatically propagated across a collectionof imagery that share visible features (by image metrics such aseigenvectors, color histograms, keypoint descriptors, FFTs, or otherclassification techniques), or that have a metadata match.

To illustrate, if the user takes a cell phone picture of a cityfountain, and the image is tagged with GPS information, it can besubmitted to a process that identifies matching Flickr/Google images ofthat fountain on a feature-recognition basis. To each of those imagesthe process can add GPS information from the user's image.

A second level of searching can also be employed. From the set offountain images identified from the first search based on similarity ofappearance, metadata can be harvested and ranked, as above. Flickr canthen be searched a second time, for images having metadata that matcheswithin a specified threshold (e.g., as reviewed above). To those images,too, GPS information from the user's image can be added.

Alternatively, or in addition, a first set of images in Flickr/Googlesimilar to the user's image of the fountain can be identified—not bypattern matching, but by GPS-matching (or both). Metadata can beharvested and ranked from these GPS-matched images. Flickr can besearched a second time for a second set of images with similar metadata.To this second set of images, GPS information from the user's image canbe added.

Another approach to geolocating imagery is by searching Flickr forimages having similar image characteristics (e.g., gist, eigenvectors,color histograms, keypoint descriptors, FFTs, etc.), and assessinggeolocation data in the identified images to infer the probable locationof the original image. See, e.g., Hays, et al, IM2GPS: Estimatinggeographic information from a single image, Proc. of the IEEE Conf. onComputer Vision and Pattern Recognition, 2008. Techniques detailed inthe Hays paper are suited for use in conjunction with the presenttechnology (including use of probability functions as quantizing theuncertainty of inferential techniques).

When geolocation data is captured by the camera, it is highly reliable.Also generally reliable is metadata (location or otherwise) that isauthored by the proprietor of the image. However, when metadatadescriptors (geolocation or semantic) are inferred or estimated, orauthored by a stranger to the image, uncertainty and other issues arise.

Desirably, such intrinsic uncertainty should be memorialized in somefashion so that later users thereof (human or machine) can take thisuncertainty into account.

One approach is to segregate uncertain metadata from device-authored orcreator-authored metadata. For example, different data structures can beused. Or different tags can be used to distinguish such classes ofinformation. Or each metadata descriptor can have its own sub-metadata,indicating the author, creation date, and source of the data. The authoror source field of the sub-metadata may have a data string indicatingthat the descriptor was inferred, estimated, deduced, etc., or suchinformation may be a separate sub-metadata tag.

Each uncertain descriptor may be given a confidence metric or rank. Thisdata may be determined by the public, either expressly or inferentially.An example is the case when a user sees a Flickr picture she believes tobe from Yellowstone, and adds a “Yellowstone” location tag, togetherwith a “95%” confidence tag (her estimation of certainty about thecontributed location metadata). She may add an alternate locationmetatag, indicating “Montana,” together with a corresponding 50%confidence tag. (The confidence tags needn't sum to 100%. Just one tagcan be contributed—with a confidence less than 100%. Or several tags canbe contributed—possibly overlapping, as in the case with Yellowstone andMontana).

If several users contribute metadata of the same type to an image (e.g.,location metadata), the combined contributions can be assessed togenerate aggregate information. Such information may indicate, forexample, that 5 of 6 users who contributed metadata tagged the image asYellowstone, with an average 93% confidence; that 1 of 6 users taggedthe image as Montana, with a 50% confidence, and 2 of 6 users tagged theimage as Glacier National park, with a 15% confidence, etc.

Inferential determination of metadata reliability can be performed,either when express estimates made by contributors are not available, orroutinely. An example of this is the FIG. 21 photo case, in whichmetadata occurrence counts are used to judge the relative merit of eachitem of metadata (e.g., Telephone=19 or 7, depending on the methodologyused). Similar methods can be used to rank reliability when severalmetadata contributors offer descriptors for a given image.

Crowd-sourcing techniques are known to parcel image-identification tasksto online workers, and collect the results. However, prior artarrangements are understood to seek simple, short-term consensus onidentification. Better, it seems, is to quantify the diversity ofopinion collected about image contents (and optionally its variationover time, and information about the sources relied-on), and use thatricher data to enable automated systems to make more nuanced decisionsabout imagery, its value, its relevance, its use, etc.

To illustrate, known crowd-sourcing image identification techniques mayidentify the FIG. 35 image with the identifiers “soccer ball” and “dog.”These are the consensus terms from one or several viewers. Disregarded,however, may be information about the long tail of alternativedescriptors, e.g., summer, Labrador, football, tongue, afternoon,evening, morning, fescue, etc. Also disregarded may be demographic andother information about the persons (or processes) that served asmetadata identifiers, or the circumstances of their assessments. Aricher set of metadata may associate with each descriptor a set ofsub-metadata detailing this further information.

The sub-metadata may indicate, for example, that the tag “football” wascontributed by a 21 year old male in Brazil on Jun. 18, 2008. It mayfurther indicate that the tags “afternoon,” “evening” and “morning” werecontributed by an automated image classifier at the University of Texasthat made these judgments on Jul. 2, 2008 based, e.g., on the angle ofillumination on the subjects. Those three descriptors may also haveassociated probabilities assigned by the classifier, e.g., 50% forafternoon, 30% for evening, and 20% for morning (each of thesepercentages may be stored as a sub-metatag). One or more of the metadataterms contributed by the classifier may have a further sub-tag pointingto an on-line glossary that aids in understanding the assigned terms.For example, such as sub-tag may give the URL of a computer resourcethat associates the term “afternoon” with a definition, or synonyms,indicating that the term means noon to 7 pm. The glossary may furtherindicate a probability density function, indicating that the mean timemeant by “afternoon” is 3:30 pm, the median time is 4:15 pm, and theterm has a Gaussian function of meaning spanning the noon to 7 pm timeinterval.

Expertise of the metadata contributors may also be reflected insub-metadata. The term “fescue” may have sub-metadata indicating it wascontributed by a 45 year old grass seed farmer in Oregon. An automatedsystem can conclude that this metadata term was contributed by a personhaving unusual expertise in a relevant knowledge domain, and maytherefore treat the descriptor as highly reliable (albeit maybe nothighly relevant). This reliability determination can be added to themetadata collection, so that other reviewers of the metadata can benefitfrom the automated system's assessment.

Assessment of the contributor's expertise can also be self-made by thecontributor. Or it can be made otherwise, e.g., by reputational rankingsusing collected third party assessments of the contributor's metadatacontributions. (Such reputational rankings are known, e.g., from publicassessments of sellers on EBay, and of book reviewers on Amazon.)Assessments may be field-specific, so a person may be judged (orself-judged) to be knowledgeable about grass types, but not about dogbreeds. Again, all such information is desirably memorialized insub-metatags (including sub-sub-metatags, when the information is abouta sub-metatag).

More information about crowd-sourcing, including use of contributorexpertise, etc., is found in Digimarc's published patent application20070162761.

Returning to the case of geolocation descriptors (which may be numeric,e.g., latitude/longitude, or textual), an image may accumulate—overtime—a lengthy catalog of contributed geographic descriptors. Anautomated system (e.g., a server at Filch) may periodically review thecontributed geotag information, and distill it to facilitate public use.For numeric information, the process can apply known clusteringalgorithms to identify clusters of similar coordinates, and average sameto generate a mean location for each cluster. For example, a photo of ageyser may be tagged by some people with latitude/longitude coordinatesin Yellowstone, and by others with latitude/longitude coordinates ofHells Gate Park in New Zealand. These coordinates thus form distinct twoclusters that would be separately averaged. If 70% of the contributorsplaced the coordinates in Yellowstone, the distilled (averaged) valuemay be given a confidence of 70%. Outlier data can be maintained, butgiven a low probability commensurate with its outlier status. Suchdistillation of the data by a proprietor can be stored in metadatafields that are readable by the public, but not writable.

The same or other approach can be used with added textual metadata—e.g.,it can be accumulated and ranked based on frequency of occurrence, togive a sense of relative confidence.

The technology detailed in this specification finds numerousapplications in contexts involving watermarking, bar-coding,fingerprinting, OCR-decoding, and other approaches for obtaininginformation from imagery. Consider again the FIG. 21 cell phone photo ofa desk phone. Flickr can be searched based on image metrics to obtain acollection of subject-similar images (e.g., as detailed above). A dataextraction process (e.g., watermark decoding, fingerprint calculation,barcode- or OCR-reading) can be applied to some or all of the resultingimages, and information gleaned thereby can be added to the metadata forthe FIG. 21 image, and/or submitted to a service provider with imagedata (either for the FIG. 21 image, and/or for related images).

From the collection of images found in the first search, text or GPSmetadata can be harvested, and a second search can be conducted forsimilarly-tagged images. From the text tags Cisco and VOIP, for example,a search of Flickr may find a photo of the underside of the user'sphone—with OCR-readable data—as shown in FIG. 36. Again, the extractedinformation can be added to the metadata for the FIG. 21 image, and/orsubmitted to a service provider to enhance the response it is able toprovide to the user.

As just shown, a cell phone user may be given the ability to look aroundcorners and under objects—by using one image as a portal to a largecollection of related images.

User Interface

Referring to FIGS. 44 and 45A, cell phones and related portable devices110 typically include a display 111 and a keypad 112. In addition to anumeric (or alphanumeric) keypad there is often a multi-functioncontroller 114. One popular controller has a center button 118, and foursurrounding buttons 116 a, 116 b, 116 c and 116 d (also shown in FIG.44).

An illustrative usage model is as follows. A system responds to an image128 (either optically captured or wirelessly received) by displaying acollection of related images to the user, on the cell phone display. Forexample, the user captures an image and submits it to a remote service.The service determines image metrics for the submitted image (possiblyafter pre-processing, as detailed above), and searches (e.g., Flickr)for visually similar images. These images are transmitted to the cellphone (e.g., by the service, or directly from Flickr), and they arebuffered for display. The service can prompt the user, e.g., byinstructions presented on the display, to repeatedly press theright-arrow button 116 b on the four-way controller (or press-and-hold)to view a sequence of pattern-similar images (130, FIG. 45A). Each timethe button is pressed, another one of the buffered apparently-similarimages is displayed.

By techniques like those earlier described, or otherwise, the remoteservice can also search for images that are similar in geolocation tothe submitted image. These too can be sent to and buffered at the cellphone. The instructions may advise that the user can press theleft-arrow button 116 d of the controller to review these GPS-similarimages (132, FIG. 45A).

Similarly, the service can search for images that are similar inmetadata to the submitted image (e.g., based on textual metadatainferred from other images, identified by pattern matching or GPSmatching). Again, these images can be sent to the phone and buffered forimmediate display. The instructions may advise that the user can pressthe up-arrow button 116 a of the controller to view thesemetadata-similar images (134, FIG. 45A).

Thus, by pressing the right, left, and up buttons, the user can reviewimages that are similar to the captured image in appearance, location,or metadata descriptors.

Whenever such review reveals a picture of particular interest, the usercan press the down button 116 c. This action identifies thecurrently-viewed picture to the service provider, which then can repeatthe process with the currently-viewed picture as the base image. Theprocess then repeats with the user-selected image as the base, and withbutton presses enabling review of images that are similar to that baseimage in appearance (16 b), location (16 d), or metadata (16 a).

This process can continue indefinitely. At some point the user can pressthe center button 118 of the four-way controller. This action submitsthe then-displayed image to a service provider for further action (e.g.,triggering a corresponding response, as disclosed, e.g., inearlier-cited documents). This action may involve a different serviceprovider than the one that provided all the alternative imagery, or theycan be the same. (In the latter case the finally-selected image need notbe sent to the service provider, since that service provider knows allthe images buffered by the cell phone, and may track which image iscurrently being displayed.)

The dimensions of information browsing just-detailed (similar-appearanceimages; similar-location images; similar-metadata images) can bedifferent in other embodiments. Consider, for example, an embodimentthat takes an image of a house as input (or latitude/longitude), andreturns the following sequences of images: (a) the houses for salenearest in location to the input-imaged house; (b) the houses for salenearest in price to the input-imaged house; and (c) the houses for salenearest in features (e.g., bedrooms/baths) to the input-imaged house.(The universe of houses displayed can be constrained, e.g., by zip-code,metropolitan area, school district, or other qualifier.)

Another example of this user interface technique is presentation ofsearch results from EBay for auctions listing Xbox 360 game consoles.One dimension can be price (e.g., pushing button 116 b yields a sequenceof screens showing Xbox 360 auctions, starting with the lowest-pricedones); another can be seller's geographical proximity to user (closestto furthest, shown by pushing button 116 d); another can be time untilend of auction (shortest to longest, presented by pushing button 116 a).Pressing the middle button 118 can load the full web page of the auctionbeing displayed.

A related example is a system that responds to a user-captured image ofa car by identifying the car (using image features and associateddatabase(s)), searching EBay and Craigslist for similar cars, andpresenting the results on the screen. Pressing button 116 b presentsscreens of information about cars offered for sale (e.g., includingimage, seller location, and price) based on similarity to the inputimage (same model year/same color first, and then nearest modelyears/colors), nationwide. Pressing button 116 d yields such a sequenceof screens, but limited to the user's state (or metropolitan region, ora 50 mile radius of the user's location, etc). Pressing button 116 ayields such a sequence of screens, again limited geographically, butthis time presented in order of ascending price (rather than closestmodel year/color). Again, pressing the middle button loads the full webpage (EBay or Craigslist) of the car last-displayed.

Another embodiment is an application that helps people recall names. Auser sees a familiar person at a party, but can't remember his name.Surreptitiously the user snaps a picture of the person, and the image isforwarded to a remote service provider. The service provider extractsfacial recognition parameters and searches social networking sites(e.g., FaceBook, MySpace, Linked-In), or a separate database containingfacial recognition parameters for images on those sites, forsimilar-appearing faces. (The service may provide the user's sign-oncredentials to the sites, allowing searching of information that is nototherwise publicly accessible.) Names and other information aboutsimilar-appearing persons located via the searching are returned to theuser's cell phone—to help refresh the user's memory.

Various UI procedures are contemplated. When data is returned from theremote service, the user may push button 116 b to scroll thru matches inorder of closest-similarity—regardless of geography. Thumbnails of thematched individuals with associated name and other profile informationcan be displayed, or just full screen images of the person can bepresented—with the name overlaid. When the familiar person isrecognized, the user may press button 118 to load the fullFaceBook/MySpace/Linked-In page for that person. Alternatively, insteadof presenting images with names, just a textual list of names may bepresented, e.g., all on a single screen—ordered by similarity offace-match; SMS text messaging can suffice for this last arrangement.

Pushing button 116 d may scroll thru matches in order ofclosest-similarity, of people who list their residence as within acertain geographical proximity (e.g., same metropolitan area, samestate, same campus, etc.) of the user's present location or the user'sreference location (e.g., home). Pushing button 116 a may yield asimilar display, but limited to persons who are “Friends” of the userwithin a social network (or who are Friends of Friends, or who arewithin another specified degree of separation of the user).

A related arrangement is a law enforcement tool in which an officercaptures an image of a person and submits same to a database containingfacial portrait/eigenvalue information from government driver licenserecords and/or other sources. Pushing button 116 b causes the screen todisplay a sequence of images/biographical dossiers about personsnationwide having the closest facial matches. Pushing button 116 dcauses the screen to display a similar sequence, but limited to personswithin the officer's state. Button 116 a yields such a sequence, butlimited to persons within the metropolitan area in which the officer isworking.

Instead of three dimensions of information browsing (buttons 116 b, 116d, 116 a, e.g., for similar-appearing images/similarly locatedimages/similar metadata-tagged images), more or less dimensions can beemployed. FIG. 45B shows browsing screens in just two dimensions.(Pressing the right button yields a first sequence 140 of informationscreens; pressing the left button yields a different sequence 142 ofinformation screens.)

Instead of two or more distinct buttons, a single UI control can beemployed to navigate in the available dimensions of information. Ajoystick is one such device. Another is a roller wheel (or scrollwheel). Portable device 110 of FIG. 44 has a roller wheel 124 on itsside, which can be rolled-up or rolled-down. It can also be pressed-into make a selection (e g, akin to buttons 116 c or 118 of theearlier-discussed controller). Similar controls are available on manymice.

In most user interfaces, opposing buttons (e.g., left button 116 b, andright button 116 d) navigate the same dimension of information—just inopposite directions (e.g., forward/reverse). In the particular interfacediscussed above, it will be recognized that this is not the case(although in other implementations, it may be so). Pressing the rightbutton 116 b, and then pressing the left button 116 d, does not returnthe system to its original state. Instead, pressing the right buttongives, e.g., a first similar-appearing image, and pressing the leftbutton gives the first similarly-located image.

Sometimes it is desirable to navigate through the same sequence ofscreens, but in reverse of the order just-reviewed. Various interfacecontrols can be employed to do this.

One is a “Reverse” button. The device 110 in FIG. 44 includes a varietyof buttons not-yet discussed (e.g., buttons 120 a-120 f, around theperiphery of the controller 114). Any of these—if pressed—can serve toreverse the scrolling order. By pressing, e.g., button 120 a, thescrolling (presentation) direction associated with nearby button 116 bcan be reversed. So if button 116 b normally presents items in order ofincreasing cost, activation of button 120 a can cause the function ofbutton 116 b to switch, e.g., to presenting items in order of decreasingcost. If, in reviewing screens resulting from use of button 116 b, theuser “overshoots” and wants to reverse direction, she can push button120 a, and then push button 116 b again. The screen(s) earlier presentedwould then appear in reverse order—starting from the present screen.

Or, operation of such a button (e.g., 120 a or 120 f) can cause theopposite button 116 d to scroll back thru the screens presented byactivation of button 116 b, in reverse order.

A textual or symbolic prompt can be overlaid on the display screen inall these embodiments—informing the user of the dimension of informationthat is being browsed, and the direction (e.g., browsing by cost:increasing).

In still other arrangements, a single button can perform multiplefunctions. For example, pressing button 116 b can cause the system tostart presenting a sequence of screens, e.g., showing pictures of housesfor sale nearest the user's location—presenting each for 800milliseconds (an interval set by preference data entered by the user).Pressing button 116 b a second time can cause the system to stop thesequence—displaying a static screen of a house for sale. Pressing button116 b a third time can cause the system to present the sequence inreverse order, starting with the static screen and going backwards thruthe screens earlier presented. Repeated operation of buttons 116 a, 116b, etc., can operate likewise (but control different sequences ofinformation, e.g., houses closest in price, and houses closest infeatures).

In arrangements in which the presented information stems from a processapplied to a base image (e.g., a picture snapped by a user), this baseimage may be presented throughout the display—e.g., as a thumbnail in acorner of the display. Or a button on the device (e.g., 126 a, or 120 b)can be operated to immediately summon the base image back to thedisplay.

Touch interfaces are gaining in popularity, such as in productsavailable from Apple and Microsoft (detailed, e.g., in Apple's patentpublications 20060026535, 20060026536, 20060250377, 20080211766,20080158169, 20080158172, 20080204426, 20080174570, and Microsoft'spatent publications 20060033701, 20070236470 and 20080001924). Suchtechnologies can be employed to enhance and extend the just-revieweduser interface concepts—allowing greater degrees of flexibility andcontrol. Each button press noted above can have a counterpart gesture inthe vocabulary of the touch screen system.

For example, different touch-screen gestures can invoke display of thedifferent types of image feeds just reviewed. A brushing gesture to theright, for example, may present a rightward-scrolling series of imageframes 130 of imagery having similar visual content (with the initialspeed of scrolling dependent on the speed of the user gesture, and withthe scrolling speed decelerating—or not—over time). A brushing gestureto the left may present a similar leftward-scrolling display of imagery132 having similar GPS information. A brushing gesture upward maypresent images an upward-scrolling display of imagery 134 similar inmetadata. At any point the user can tap one of the displayed images tomake it the base image, with the process repeating.

Other gestures can invoke still other actions. One such action isdisplaying overhead imagery corresponding to the GPS location associatedwith a selected image. The imagery can be zoomed in/out with othergestures. The user can select for display photographic imagery, mapdata, data from different times of day or different dates/seasons,and/or various overlays (topographic, places of interest, and otherdata, as is known from Google Earth), etc. Icons or other graphics maybe presented on the display depending on contents of particular imagery.One such arrangement is detailed in Digimarc's published application20080300011.

“Curbside” or “street-level” imagery—rather than overhead imagery—can bealso displayed.

It will be recognized that certain embodiments of the present technologyinclude a shared general structure. An initial set of data (e.g., animage, or metadata such as descriptors or geocode information, or imagemetrics such as eigenvalues) is presented. From this, a second set ofdata (e.g., images, or image metrics, or metadata) are obtained. Fromthat second set of data, a third set of data is compiled (e.g., imageswith similar image metrics or similar metadata, or image metrics, ormetadata). Items from the third set of data can be used as a result ofthe process, or the process may continue, e.g., by using the third setof data in determining fourth data (e.g., a set of descriptive metadatacan be compiled from the images of the third set). This can continue,e.g., determining a fifth set of data from the fourth (e.g., identifyinga collection of images that have metadata terms from the fourth dataset). A sixth set of data can be obtained from the fifth (e.g.,identifying clusters of GPS data with which images in the fifth set aretagged), and so on.

The sets of data can be images, or they can be other forms of data(e.g., image metrics, textual metadata, geolocation data, decoded OCR-,barcode-, watermark-data, etc).

Any data can serve as the seed. The process can start with image data,or with other information, such as image metrics, textual metadata (akasemantic metadata), geolocation information (e.g., GPS coordinates),decoded OCR/barcode/watermark data, etc. From a first type ofinformation (image metrics, semantic metadata, GPS info, decoded info),a first set of information-similar images can be obtained. From thatfirst set, a second, different type of information (imagemetrics/semantic metadata/GPS/decoded info, etc.) can be gathered. Fromthat second type of information, a second set of information-similarimages can be obtained. From that second set, a third, different type ofinformation (image metrics/semantic metadata/GPS/decoded info, etc.) canbe gathered. From that third type of information, a third set ofinformation-similar images can be obtained. Etc.

Thus, while the illustrated embodiments generally start with an image,and then proceed by reference to its image metrics, and so on, entirelydifferent combinations of acts are also possible. The seed can be thepayload from a product barcode. This can generate a first collection ofimages depicting the same barcode. This can lead to a set of commonmetadata. That can lead to a second collection of images based on thatmetadata. Image metrics may be computed from this second collection, andthe most prevalent metrics can be used to search and identify a thirdcollection of images. The images thus identified can be presented to theuser using the arrangements noted above.

In some embodiments, the present technology may be regarded as employingan iterative, recursive process by which information about one set ofimages (a single image in many initial cases) is used to identify asecond set of images, which may be used to identify a third set ofimages, etc. The function by which each set of images is related to thenext relates to a particular class of image information, e.g., imagemetrics, semantic metadata, GPS, decoded info, etc.

In other contexts, the relation between one set of images and the nextis a function not just of one class of information, but two or more. Forexample, a seed user image may be examined for both image metrics andGPS data. From these two classes of information a collection of imagescan be determined—images that are similar in both some aspect of visualappearance and location. Other pairings, triplets, etc., ofrelationships can naturally be employed—in the determination of any ofthe successive sets of images.

Further Discussion

Some embodiments of the present technology analyze a consumer cell phonepicture, and heuristically determine information about the picture'ssubject. For example, is it a person, place, or thing? From this highlevel determination, the system can better formulate what type ofresponse might be sought by the consumer—making operation moreintuitive.

For example, if the subject of the photo is a person, the consumer mightbe interested in adding the depicted person as a FaceBook “friend.” Orsending a text message to that person. Or publishing an annotatedversion of the photo to a web page. Or simply learning who the personis.

If the subject is a place (e.g., Times Square), the consumer might beinterested in the local geography, maps, and nearby attractions.

If the subject is a thing (e.g., the Liberty Bell or a bottle of beer),the consumer may be interested in information about the object (e.g.,its history, others who use it), or in buying or selling the object,etc.

Based on the image type, an illustrative system/service can identify oneor more actions that it expects the consumer will find mostappropriately responsive to the cell phone image. One or all of thesecan be undertaken, and cached on the consumer's cell phone for review.For example, scrolling a thumbwheel on the side of the cell phone maypresent a succession of different screens—each with differentinformation responsive to the image subject. (Or a screen may bepresented that queries the consumer as to which of a few possibleactions is desired.)

In use, the system can monitor which of the available actions is chosenby the consumer. The consumer's usage history can be employed to refinea Bayesian model of the consumer's interests and desires, so that futureresponses can be better customized to the user.

These concepts will be clearer by example (aspects of which aredepicted, e.g., in FIGS. 46 and 47).

Processing a Set of Sample Images

Assume a tourist snaps a photo of the Prometheus statue at RockefellerCenter in New York using a cell phone or other mobile device. Initially,it is just a bunch of pixels. What to do?

Assume the image is geocoded with location information (e.g.,latitude/longitude in XMP- or EXIF-metadata).

From the geocode data, a search of Flickr can be undertaken for a firstset of images—taken from the same (or nearby) location. Perhaps thereare 5 or 500 images in this first set.

Metadata from this set of images is collected. The metadata can be ofvarious types. One is words/phrases from a title given to an image.Another is information in metatags assigned to the image—usually by thephotographer (e.g., naming the photo subject and certainattributes/keywords), but additionally by the capture device (e.g.,identifying the camera model, the date/time of the photo, the location,etc). Another is words/phrases in a narrative description of the photoauthored by the photographer.

Some metadata terms may be repeated across different images. Descriptorscommon to two or more images can be identified (clustered), and the mostpopular terms may be ranked. (Such as listing is shown at “A” in FIG.46A. Here, and in other metadata listings, only partial results aregiven for expository convenience.)

From the metadata, and from other analysis, it may be possible todetermine which images in the first set are likely person-centric, whichare place-centric, and which are thing-centric.

Consider the metadata with which a set of 50 images may be tagged. Someof the terms relate to place. Some relate to persons depicted in theimages. Some relate to things.

Place-Centric Processing

Terms that relate to place can be identified using various techniques.One is to use a database with geographical information to look-uplocation descriptors near a given geographical position. Yahoo'sGeoPlanet service, for example, returns a hierarchy of descriptors suchas “Rockefeller Center,” “10024” (a zip code), “Midtown Manhattan,” “NewYork,” “Manhattan,” “New York,” and “United States,” when queried withthe latitude/longitude of the Rockefeller Center.

The same service can return names of adjoining/siblingneighborhoods/features on request, e.g., “10017,” “10020,” “10036,”“Theater District,” “Carnegie Hall,” “Grand Central Station,” “Museum ofAmerican Folk Art,” etc., etc.

Nearby street names can be harvested from a variety of mapping programs,given a set of latitude/longitude coordinates or other location info.

A glossary of nearby place-descriptors can be compiled in such manner.The metadata harvested from the set of Flickr images can then beanalyzed, by reference to the glossary, to identify the terms thatrelate to place (e.g., that match terms in the glossary).

Consideration then turns to use of these place-related metadata in thereference set of images collected from Flickr.

Some images may have no place-related metadata. These images are likelyperson-centric or thing-centric, rather than place-centric.

Other images may have metadata that is exclusively place-related. Theseimages are likely place-centric, rather than person-centric orthing-centric.

In between are images that have both place-related metadata, and othermetadata. Various rules can be devised and utilized to assign therelative relevance of the image to place.

One rule looks at the number of metadata descriptors associated with animage, and determines the fraction that is found in the glossary ofplace-related terms. This is one metric.

Another looks at where in the metadata the place-related descriptorsappear. If they appear in an image title, they are likely more relevantthan if they appear at the end of a long narrative description about thephotograph. Placement of the placement-related metadata is anothermetric.

Consideration can also be given to the particularity of theplace-related descriptor. A descriptor “New York” or “USA” may be lessindicative that an image is place-centric than a more particulardescriptor, such as “Rockefeller Center” or “Grand Central Station.”This can yield a third metric.

A related, fourth metric considers the frequency of occurrence (orimprobability) of a term—either just within the collected metadata, orwithin a superset of that data. “RCA Building” is more relevant, fromthis standpoint, than “Rockefeller Center” because it is used much lessfrequently.

These and other metrics can be combined to assign each image in the setwith a place score indicating its potential place-centric-ness.

The combination can be a straight sum of four factors, each ranging from0 to 100. More likely, however, some metrics will be weighted moreheavily. The following equation employing metrics M1, M2, M2 and M4 canbe employed to yield a score S, with the factors A, B, C, D andexponents W, X, Y and Z determined experimentally, or by Bayesiantechniques:S=(A*M1)^(W)+(B*M2)^(X)+(C*M3)^(Y)+(D*M4)^(Z)Person-Centric Processing

A different analysis can be employed to estimate the person-centric-nessof each image in the set obtained from Flickr.

As in the example just-given, a glossary of relevant terms can becompiled—this time terms associated with people. In contrast to theplace name glossary, the person name glossary can be global—rather thanassociated with a particular locale. (However, different glossaries maybe appropriate in different countries.)

Such a glossary can be compiled from various sources, includingtelephone directories, lists of most popular names, and other referenceworks where names appear. The list may start, “Aaron, Abigail, Adam,Addison, Adrian, Aidan, Aiden, Alex, Alexa, Alexander, Alexandra,Alexis, Allison, Alyssa, Amelia, Andrea, Andrew, Angel, Angelina, Anna,Anthony, Antonio, Ariana, Arianna, Ashley, Aubrey, Audrey, Austin,Autumn, Ava, Avery . . . ”

First names alone can be considered, or last names can be consideredtoo. (Some names may be a place name or a person name. Searching foradjoining first/last names and/or adjoining place names can helpdistinguish ambiguous cases. E.g., Elizabeth Smith is a person;Elizabeth N.J. is a place.)

Personal pronouns and the like can also be included in such a glossary(e.g., he, she, him, her, his, our, her, I, me, myself, we, they, them,mine, their). Nouns identifying people and personal relationships canalso be included (e.g., uncle, sister, daughter, gramps, boss, student,employee, wedding, etc)

Adjectives and adverbs that are usually applied to people may also beincluded in the person-term glossary (e.g., happy, boring, blonde, etc),as can the names of objects and attributes that are usually associatedwith people (e.g., t-shirt, backpack, sunglasses, tanned, etc.). Verbsassociated with people can also be employed (e.g., surfing, drinking).

In this last group, as in some others, there are some terms that couldalso apply to thing-centric images (rather than person-centric). Theterm “sunglasses” may appear in metadata for an image depictingsunglasses, alone; “happy” may appear in metadata for an image depictinga dog. There are also some cases where a person-term may also be aplace-term (e.g., Boring, Oregon). In more sophisticated embodiments,glossary terms can be associated with respective confidence metrics, bywhich any results based on such terms may be discounted or otherwiseacknowledged to have different degrees of uncertainty.)

As before, if an image is not associated with any person-relatedmetadata, then the image can be adjudged likely not person-centric.Conversely, if all of the metadata is person-related, the image islikely person-centric.

For other cases, metrics like those reviewed above can be assessed andcombined to yield a score indicating the relative person-centric-ness ofeach image, e.g., based on the number, placement, particularity and/orfrequency/improbability of the person-related metadata associated withthe image.

While analysis of metadata gives useful information about whether animage is person-centric, other techniques can also be employed—eitheralternatively, or in conjunction with metadata analysis.

One technique is to analyze the image looking for continuous areas ofskin-tone colors. Such features characterize many features ofperson-centric images, but are less frequently found in images of placesand things.

A related technique is facial recognition. This science has advanced tothe point where even inexpensive point-and-shoot digital cameras canquickly and reliably identify faces within an image frame (e.g., tofocus or expose the image based on such subjects).

(Face finding technology is detailed, e.g., in U.S. Pat. No. 5,781,650(Univ. of Central Florida), U.S. Pat. No. 6,633,655 (Sharp), U.S. Pat.No. 6,597,801 (Hewlett-Packard) and U.S. Pat. No. 6,430,306 (L-1 Corp.),and in Yang et al, Detecting Faces in Images: A Survey, IEEETransactions on Pattern Analysis and Machine Intelligence, Vol. 24, No.1, Jan. 2002, pp. 34-58, and Zhao, et al, Face Recognition: A LiteratureSurvey, ACM Computing Surveys, 2003, pp. 399-458. Additional papersabout facial recognition technologies are noted in a bibliography at theend of the provisional specification to which this application claimspriority.)

Facial recognition algorithms can be applied to the set of referenceimages obtained from Flickr, to identify those that have evident faces,and identify the portions of the images corresponding to the faces.

Of course, many photos have faces depicted incidentally within the imageframe. While all images having faces could be identified asperson-centric, most embodiments employ further processing to provide amore refined assessment.

One form of further processing is to determine the percentage area ofthe image frame occupied by the identified face(s). The higher thepercentage, the higher the likelihood that the image is person-centric.This is another metric than can be used in determining an image'sperson-centric score.

Another form of further processing is to look for the existence of (1)one or more faces in the image, together with (2) person-descriptors inthe metadata associated with the image. In this case, the facialrecognition data can be used as a “plus” factor to increase aperson-centric score of an image based on metadata or other analysis.(The “plus” can take various forms. E.g., a score (in a 0-100 scale) canbe increased by 10, or increased by 10%. Or increased by half theremaining distance to 100, etc.)

Thus, for example, a photo tagged with “Elizabeth” metadata is morelikely a person-centric photo if the facial recognition algorithm findsa face within the image than if no face is found.

(Conversely, the absence of any face in an image can be used as a “plus”factor to increase the confidence that the image subject is of adifferent type, e.g., a place or a thing. Thus, an image tagged withElizabeth as metadata, but lacking any face, increases the likelihoodthat the image relates to a place named Elizabeth, or a thing namedElizabeth—such as a pet.)

Still more confidence in the determination can be assumed if the facialrecognition algorithm identifies a face as a female, and the metadataincludes a female name. Such an arrangement, of course, requires thatthe glossary—or other data structure—have data that associates genderswith at least some names.

(Still more sophisticated arrangements can be implemented. For example,the age of the depicted person(s) can be estimated using automatedtechniques (e.g., as detailed in U.S. Pat. No. 5,781,650, to Univ. ofCentral Florida). Names found in the image metadata can also beprocessed to estimate the age of the thus-named person(s). This can bedone using public domain information about the statistical distributionof a name as a function of age (e.g., from published Social SecurityAdministration data, and web sites that detail most popular names frombirth records). Thus, names Mildred and Gertrude may be associated withan age distribution that peaks at age 80, whereas Madison and Alexis maybe associated with an age distribution that peaks at age 8. Findingstatistically-likely correspondence between metadata name and estimatedperson age can further increase the person-centric score for an image.Statistically unlikely correspondence can be used to decrease theperson-centric score. (Estimated information about the age of a subjectin the consumer's image can also be used to tailor the intuitedresponse(s), as may information about the subject's gender.))

Just as detection of a face in an image can be used as a “plus” factorin a score based on metadata, the existence of person-centric metadatacan be used as a “plus” factor to increase a person-centric score basedon facial recognition data.

Of course, if no face is found in an image, this information can be usedto reduce a person-centric score for the image (perhaps down to zero).

Thing-Centric Processing

A thing-centered image is the third type of image that may be found inthe set of images obtained from Flickr in the present example. There arevarious techniques by which a thing-centric score for an image can bedetermined.

One technique relies on metadata analysis, using principles like thosedetailed above. A glossary of nouns can be compiled—either from theuniverse of Flickr metadata or some other corpus (e.g., WordNet), andranked by frequency of occurrence. Nouns associated with places andpersons can be removed from the glossary. The glossary can be used inthe manners identified above to conduct analyses of the images'metadata, to yield a score for each.

Another approach uses pattern matching to identify thing-centricimages—matching each against a library of known thing-related images.

Still another approach is based on earlier-determined scores forperson-centric and place-centric. A thing-centric score may be assignedin inverse relationship to the other two scores (i.e., if an imagescores low for being person-centric, and low for being place-centric,then it can be assigned a high score for being thing-centric).

Such techniques may be combined, or used individually. In any event, ascore is produced for each image—tending to indicate whether the imageis more- or less-likely to be thing-centric.

Further Processing of Sample Set of Images

Data produced by the foregoing techniques can produce three scores foreach image in the set, indicating roughconfidence/probability/likelihood that the image is (1) person-centric,(2) place-centric, or (3) thing-centric. These scores needn't add to100% (although they may). Sometimes an image may score high in two ormore categories. In such case the image may be regarded as havingmultiple relevance, e.g., as both depicting a person and a thing.

The set of images downloaded from Flickr may next be segregated intogroups, e.g., A, B and C, depending on whether identified as primarilyperson-centric, place-centric, or thing-centric. However, since someimages may have split probabilities (e.g., an image may have someindicia of being place-centric, and some indicia of beingperson-centric), identifying an image wholly by its high score ignoresuseful information. Preferable is to calculate a weighted score for theset of images—taking each image's respective scores in the threecategories into account.

A sample of images from Flickr—all taken near Rockefeller Center—maysuggest that 60% are place-centric, 25% are person-centric, and 15% arething-centric.

This information gives useful insight into the tourist's cell phoneimage—even without regard to the contents of the image itself (exceptits geocoding). That is, chances are good that the image isplace-centric, with less likelihood it is person-centric, and still lessprobability it is thing centric. (This ordering can be used to determinethe order of subsequent steps in the process—allowing the system to morequickly gives responses that are most likely to be appropriate.)

This type-assessment of the cell phone photo can be used—alone—to helpdetermine an automated action provided to the tourist in response to theimage. However, further processing can better assess the image'scontents, and thereby allow a more particularly-tailored action to beintuited.

Similarity Assessments and Metadata Weighting

Within the set of co-located images collected from Flickr, images thatare place-centric will tend to have a different appearance than imagesthat are person-centric or thing-centric, yet tend to have somesimilarity within the place-centric group. Place-centric images may becharacterized by straight lines (e.g., architectural edges). Orrepetitive patterns (windows). Or large areas of uniform texture andsimilar color near the top of the image (sky).

Images that are person-centric will also tend to have differentappearances than the other two classes of image, yet have commonattributes within the person-centric class. For example, person-centricimages will usually have faces—generally characterized by an ovoid shapewith two eyes and a nose, areas of flesh tones, etc.

Although thing-centric images are perhaps the most diverse, images fromany given geography may tend to have unifying attributes or features.Photos geocoded at a horse track will depict horses with some frequency;photos geocoded from Independence National Historical Park inPhiladelphia will tend to depict the Liberty Bell regularly, etc.

By determining whether the cell phone image is more similar toplace-centric, or person-centric, or thing-centric images in the set ofFlickr images, more confidence in the subject of the cell phone imagecan be achieved (and a more accurate response can be intuited andprovided to the consumer).

A fixed set of image assessment criteria can be applied to distinguishimages in the three categories. However, the detailed embodimentdetermines such criteria adaptively. In particular, this embodimentexamines the set of images and determines which imagefeatures/characteristics/metrics most reliably (1) grouplike-categorized images together (similarity); and (2) distinguishdifferently-categorized images from each other (difference). Among theattributes that may be measured and checked for similarity/differencebehavior within the set of images are dominant color; color diversity;color histogram; dominant texture; texture diversity; texture histogram;edginess; wavelet-domain transform coefficient histograms, and dominantwavelet coefficients; frequency domain transfer coefficient histogramsand dominant frequency coefficients (which may be calculated indifferent color channels); eigenvalues; keypoint descriptors; geometricclass probabilities; symmetry; percentage of image area identified asfacial; image autocorrelation; low-dimensional “gists” of image; etc.(Combinations of such metrics may be more reliable than thecharacteristics individually.)

One way to determine which metrics are most salient for these purposesis to compute a variety of different image metrics for the referenceimages. If the results within a category of images for a particularmetric are clustered (e.g., if, for place-centric images, the colorhistogram results are clustered around particular output values), and ifimages in other categories have few or no output values near thatclustered result, then that metric would appear well suited for use asan image assessment criteria. (Clustering is commonly performed using animplementation of a k-means algorithm.)

In the set of images from Rockefeller Center, the system may determinethat an edginess score of >40 is reliably associated with images thatscore high as place-centric; a facial area score of >15% is reliablyassociated with images that score high as person-centric; and a colorhistogram that has a local peak in the gold tones—together with afrequency content for yellow that peaks at lower image frequencies, issomewhat associated with images that score high as thing-centric.

The analysis techniques found most useful in grouping/distinguishing thedifferent categories of images can then be applied to the user's cellphone image. The results can then be analyzed for proximity—in adistance measure sense (e.g., multi-dimensional space)—with thecharacterizing features associated with different categories of images.(This is the first time that the cell phone image has been processed inthis particular embodiment.)

Using such techniques, the cell phone image may score a 60 forthing-centric, a 15 for place-centric, and a 0 for person-centric (onscale of 0-100). This is a second, better set of scores that can be usedto classify the cell phone image (the first being the statisticaldistribution of co-located photos found in Flickr).

The similarity of the user's cell phone image may next be compared withindividual images in the reference set. Similarity metrics identifiedearlier can be used, or different measures can be applied. The time orprocessing devoted to this task can be apportioned across the threedifferent image categories based on the just-determined scores. E.g.,the process may spend no time judging similarity with reference imagesclassed as 100% person-centric, but instead concentrate on judgingsimilarity with reference images classed as thing- or place-centric(with more effort—e.g., four times as much effort—being applied to theformer than the latter). A similarity score is generated for most of theimages in the reference set (excluding those that are assessed as 100%person-centric).

Consideration then returns to metadata. Metadata from the referenceimages are again assembled—this time weighted in accordance with eachimage's respective similarity to the cell phone image. (The weightingcan be linear or exponential.) Since metadata from similar images isweighted more than metadata from dissimilar images, the resulting set ofmetadata is tailored to more likely correspond to the cell phone image.

From the resulting set, the top N (e.g., 3) metadata descriptors may beused. Or descriptors that—on a weighted basis—comprise an aggregate M %of the metadata set.

In the example given, the thus-identified metadata may comprise“Rockefeller Center,” “Prometheus,” and “Skating rink,” with respectivescores of 19, 12 and 5 (see “B” in FIG. 46B).

With this weighted set of metadata, the system can begin determiningwhat responses may be most appropriate for the consumer. In theexemplary embodiment, however, the system continues by further refiningits assessment of the cell phone image. (The system may begindetermining appropriate responses while also undertaking the furtherprocessing.)

Processing a Second Set of Reference Images

At this point the system is better informed about the cell phone image.Not only is its location known; so is its likely type (thing-centric)and some of its most-probably-relevant metadata. This metadata can beused in obtaining a second set of reference images from Flickr.

In the illustrative embodiment, Flickr is queried for images having theidentified metadata. The query can be geographically limited to the cellphone's geolocation, or a broader (or unlimited) geography may besearched. (Or the query may run twice, so that half of the images areco-located with the cell phone image, and the others are remote, etc.)

The search may first look for images that are tagged with all of theidentified metadata. In this case, 60 images are found. If more imagesare desired, Flickr may be searched for the metadata terms in differentpairings, or individually. (In these latter cases, the distribution ofselected images may be chosen so that the metadata occurrence in theresults corresponds to the respective scores of the different metadataterms, i.e., 19/12/5.)

Metadata from this second set of images can be harvested, clustered, andmay be ranked (“C” in FIG. 46B). (Noise words (“and, of, or,” etc.) canbe eliminated. Words descriptive only of the camera or the type ofphotography may also be disregarded (e.g., “Nikon,” “D80,” “HDR,” “blackand white,” etc.). Month names may also be removed.)

The analysis performed earlier—by which each image in the first set ofimages was classified as person-centric, place-centric orthing-centric—can be repeated on images in the second set of images.Appropriate image metrics for determining similarity/difference withinand between classes of this second image set can be identified (or theearlier measures can be employed). These measures are then applied, asbefore, to generate refined scores for the user's cell phone image, asbeing person-centric, place-centric, and thing-centric. By reference tothe images of the second set, the cell phone image may score a 65 forthing-centric, 12 for place-centric, and 0 for person-centric. (Thesescores may be combined with the earlier-determined scores, e.g., byaveraging, if desired.)

As before, similarity between the user's cell phone image and each imagein the second set can be determined. Metadata from each image can thenbe weighted in accordance with the corresponding similarity measure. Theresults can then be combined to yield a set of metadata weighted inaccordance image similarity.

Some of the metadata—often including some highly ranked terms—will be ofrelatively low value in determining image-appropriate responses forpresentation to the consumer. “New York,” “Manhattan,” are a fewexamples. Generally more useful will be metadata descriptors that arerelatively unusual.

A measure of “unusualness” can be computed by determining the frequencyof different metadata terms within a relevant corpus, such as Flickrimage tags (globally, or within a geolocated region), or image tags byphotographers from whom the respective images were submitted, or wordsin an encyclopedia, or in Google's index of the web, etc. The terms inthe weighted metadata list can be further weighted in accordance withtheir unusualness (i.e., a second weighting).

The result of such successive processing may yield the list of metadatashown at “D” in FIG. 46B (each shown with its respective score). Thisinformation (optionally in conjunction with a tag indicating theperson/place/thing determination) allows responses to the consumer to bewell-correlated with the cell phone photo.

It will be recognized that this set of inferred metadata for the user'scell phone photo was compiled entirely by automated processing of otherimages, obtained from public sources such as Flickr, in conjunction withother public resources (e.g., listings of names, places). The inferredmetadata can naturally be associated with the user's image. Moreimportantly for the present application, however, it can help a serviceprovider decide how best to respond to submission of the user's image.

Determining Appropriate Responses for Consumer

Referring to FIG. 50, the system just-described can be viewed as oneparticular application of an “image juicer” that receives image datafrom a user, and applies different forms of processing so as to gather,compute, and/or infer information that can be associated with the image.

As the information is discerned, it can be forwarded by a router todifferent service providers. These providers may be arranged to handledifferent types of information (e.g., semantic descriptors, imagetexture data, keypoint descriptors, eigenvalues, color histograms, etc)or to different classes of images (e.g., photo of friend, photo of a canof soda, etc). Outputs from these service providers are sent to one ormore devices (e.g., the user's cell phone) for presentation or laterreference. The present discussion now considers how these serviceproviders decide what responses may be appropriate for a given set ofinput information.

One approach is to establish a taxonomy of image subjects andcorresponding responses. A tree structure can be used, with an imagefirst being classed into one of a few high level groupings (e.g.,person/place/thing), and then each group being divided into furthersubgroups. In use, an image is assessed through different branches ofthe tree until the limits of available information allow no furtherprogress to be made. Actions associated with the terminal leaf or nodeof the tree are then taken.

Part of a simple tree structure is shown in FIG. 51. (Each node spawnsthree branches, but this is for illustration only; more or less branchescan of course be used.)

If the subject of the image is inferred to be an item of food (e.g., ifthe image is associated with food-related metadata), three differentscreens of information can be cached on the user's phone. One starts anonline purchase of the depicted item at an online vendor. (The choice ofvendor, and payment/shipping details, can be obtained from user profiledata.) The second screen shows nutritional information about theproduct. The third presents a map of the local area—identifying storesthat sell the depicted product. The user switches among these responsesusing a roller wheel 124 on the side of the phone (FIG. 44).

If the subject is inferred to be a photo of a family member or friend,one screen presented to the user gives the option of posting a copy ofthe photo to the user's FaceBook page, annotated with the person(s)'slikely name(s). (Determining the names of persons depicted in a photocan be done by submitting the photo to the user's account at Picasa.Picasa performs facial recognition operations on submitted user images,and correlates facial eigenvectors with individual names provided by theuser, thereby compiling a user-specific database of facial recognitioninformation for friends and others depicted in the user's prior images.Picasa's facial recognition is understood to be based on technologydetailed in U.S. Pat. No. 6,356,659 to Google. Apple's iPhoto softwareand Facebook's Photo Finder software include similar facial recognitionfunctionality.) Another screen starts a text message to the individual,with the addressing information having been obtained from the user'saddress book, indexed by the Picasa-determined identity. The user canpursue any or all of the presented options by switching between theassociated screens.

If the subject appears to be a stranger (e.g., not recognized byPicasa), the system will have earlier undertaken an attemptedrecognition of the person using publicly available facial recognitioninformation. (Such information can be extracted from photos of knownpersons. VideoSurf is one vendor with a database of facial recognitionfeatures for actors and other persons. L-1 Corp. maintains databases ofdriver's licenses photos and associated data which may—with appropriatesafeguards—be employed for facial recognition purposes.) The screen(s)presented to the user can show reference photos of the persons matched(together with a “match” score), as well as dossiers of associatedinformation compiled from the web and other databases. A further screengives the user the option of sending a “Friend” invite to the recognizedperson on MySpace, or another social networking site where therecognized person is found to have a presence. A still further screendetails the degree of separation between the user and the recognizedperson. (E.g., my brother David has a classmate Steve, who has a friendMatt, who has a friend Tom, who is the son of the depicted person.) Suchrelationships can be determined from association information publishedon social networking sites.

Of course, the responsive options contemplated for the differentsub-groups of image subjects may meet most user desires, but some userswill want something different. Thus, at least one alternative responseto each image may be open-ended—allowing the user to navigate todifferent information, or specify a desired response—making use ofwhatever image/metadata processed information is available.

One such open-ended approach is to submit the twice-weighted metadatanoted above (e.g., “D” in FIG. 46B) to a general purpose search engine.Google, per se, is not necessarily best for this function, becausecurrent Google searches require that all search terms be found in theresults. Better is a search engine that does fuzzy searching, and isresponsive to differently-weighted keywords—not all of which need befound. The results can indicate different seeming relevance, dependingon which keywords are found, where they are found, etc. (A resultincluding “Prometheus” but lacking “RCA Building” would be ranked morerelevant than a result including the latter but lacking the former.)

The results from such a search can be clustered by other concepts. Forexample, some of the results may be clustered because they share thetheme “art deco.” Others may be clustered because they deal withcorporate history of RCA and GE. Others may be clustered because theyconcern the works of the architect Raymond Hood. Others may be clusteredas relating to 20^(th) century American sculpture, or Paul Manship.Other concepts found to produce distinct clusters may include JohnRockefeller, The Mitsubishi Group, Colombia University, Radio City MusicHall, The Rainbow Room Restaurant, etc.

Information from these clusters can be presented to the user onsuccessive UI screens, e.g., after the screens on which prescribedinformation/actions are presented. The order of these screens can bedetermined by the sizes of the information clusters, or thekeyword-determined relevance.

Still a further response is to present to the user a Google searchscreen—pre-populated with the twice-weighted metadata as search terms.The user can then delete terms that aren't relevant to his/her interest,and add other terms, so as to quickly execute a web search leading tothe information or action desired by the user.

In some embodiments, the system response may depend on people with whomthe user has a “friend” relationship in a social network, or some otherindicia of trust. For example, if little is known about user Ted, butthere is a rich set of information available about Ted's friend Alice,that rich set of information may be employed in determining how torespond to Ted, in connection with a given content stimulus.

Similarly, if user Ted is a friend of user Alice, and Bob is a friend ofAlice, then information relating to Bob may be used in determining anappropriate response to Ted.

The same principles can be employed even if Ted and Alice are strangers,provided there is another basis for implicit trust. While basic profilesimilarity is one possible basis, a better one is the sharing an unusualattribute (or, better, several). Thus, for example, if both Ted andAlice share the traits of being fervent supporters of Dennis Kucinichfor president, and being devotees of pickled ginger, then informationrelating to one might be used in determining an appropriate response topresent to the other.

The arrangements just-described provides powerful new functionality.However, the “intuiting” of the responses likely desired by the userrely largely on the system designers. They consider the different typesof images that may be encountered, and dictate responses (or selectionsof responses) that they believe will best satisfy the users' likelydesires.

In this respect the above-described arrangements are akin to earlyindexes of the web—such as Yahoo! Teams of humans generated taxonomiesof information for which people might search, and then manually locatedweb resources that could satisfy the different search requests.

Eventually the web overwhelmed such manual efforts at organization.Google's founders were among those that recognized that an untappedwealth of information about the web could be obtained from examininglinks between the pages, and actions of users in navigating these linksUnderstanding of the system thus came from data within the system,rather than from an external perspective.

In like fashion, manually crafted trees of imageclassifications/responses will probably someday be seen as an earlystage in the development of image-responsive technologies. Eventuallysuch approaches will be eclipsed by arrangements that rely on machineunderstanding derived from the system itself, and its use.

One such technique simply examines which responsive screen(s) areselected by users in particular contexts. As such usage patterns becomeevident, the most popular responses can be moved earlier in the sequenceof screens presented to the user.

Likewise, if patterns become evident in use of the open-ended searchquery option, such action can become a standard response, and movedhigher in the presentation queue.

The usage patterns can be tailored in various dimensions of context.Males between 40 and 60 years of age, in New York, may demonstrateinterest in different responses following capture of a snapshot of astatue by a 20^(th) century sculptor, than females between 13 and 16years of age in Beijing. Most persons snapping a photo of a foodprocessor in the weeks before Christmas may be interested in finding thecheapest online vendor of the product; most persons snapping a photo ofthe same object the week following Christmas may be interested inlisting the item for sale on E-Bay or Craigslist. Etc. Desirably, usagepatterns are tracked with as many demographic and other descriptors aspossible, so as to be most-predictive of user behavior.

More sophisticated techniques can also be applied, drawing from the richsources of expressly- and inferentially-linked data sources nowavailable. These include not only the web and personal profileinformation, but all manner of other digital data we touch and in whichwe leave traces, e.g., cell phone billing statements, credit cardstatements, shopping data from Amazon and EBay, Google search history,browsing history, cached web pages, cookies, email archives, phonemessage archives from Google Voice, travel reservations on Expedia andOrbitz, music collections on iTunes, cable television subscriptions,Netflix movie choices, GPS tracking information, social network data andactivities, activities and postings on photo sites such as Flickr andPicasa and video sites such as YouTube, the times of day memorialized inthese records, etc. (our “digital life log”). Moreover, this informationis potentially available not just for the user, but also for the user'sfriends/family, for others having demographic similarities with theuser, and ultimately everyone else (with appropriate anonymizationand/or privacy safeguards).

The network of interrelationships between these data sources is smallerthan the network of web links analyzed by Google, but is perhaps richerin the diversity and types of links From it can be mined a wealth ofinferences and insights, which can help inform what a particular user islikely to want done with a particular snapped image.

Artificial intelligence techniques can be applied to the data-miningtask. One class of such techniques is natural language processing (NLP),a science that has made significant advancements recently.

One example is the Semantic Map compiled by Cognition Technologies,Inc., a database that can be used to analyze words in context, in orderto discern their meaning. This functionality can be used, e.g., toresolve homonym ambiguity in analysis of image metadata (e.g., does“bow” refer to a part of a ship, a ribbon adornment, a performer'sthank-you, or a complement to an arrow? Proximity to terms such as“Carnival cruise,” “satin,” “Carnegie Hall” or “hunting” can provide thelikely answer). U.S. Pat. No. 5,794,050 (FRCD Corp.) details underlyingtechnologies.

The understanding of meaning gained through NLP techniques can also beused to augment image metadata with other relevant descriptors—which canbe used as additional metadata in the embodiments detailed herein. Forexample, a close-up image tagged with the descriptor “hibiscus stamens”can—through NLP techniques—be further tagged with the term “flower.” (Asof this writing, Flickr has 460 images tagged with “hibiscus” and“stamen,” but omitting “flower.”)

U.S. Pat. No. 7,383,169 (Microsoft) details how dictionaries and otherlarge works of language can be processed by NLP techniques to compilelexical knowledge bases that serve as formidable sources of such “commonsense” information about the world. This common sense knowledge can beapplied in the metadata processing detailed herein. (Wikipedia isanother reference source that can serve as the basis for such aknowledge base. Our digital life log is yet another—one that yieldsinsights unique to us as individuals.)

When applied to our digital life log, NLP techniques can reach nuancedunderstandings about our historical interests and actions—informationthat can be used to model (predict) our present interests andforthcoming actions. This understanding can be used to dynamicallydecide what information should be presented, or what action should beundertaken, responsive to a particular user capturing a particular image(or to other stimulus). Truly intuitive computing will then havearrived.

Other Comments

While the image/metadata processing detailed above takes many words todescribe, it need not take much time to perform. Indeed, much of theprocessing of reference data, compilation of glossaries, etc., can bedone off-line—before any input image is presented to the system. Flickr,Yahoo! or other service providers can periodically compile andpre-process reference sets of data for various locales, to be quicklyavailable when needed to respond to an image query.

In some embodiments, other processing activities will be started inparallel with those detailed. For example, if initial processing of thefirst set of reference images suggests that the snapped image isplace-centric, the system can request likely-useful information fromother resources before processing of the user image is finished. Toillustrate, the system may immediately request a street map of thenearby area, together with a satellite view, a street view, a masstransit map, etc. Likewise, a page of information about nearbyrestaurants can be compiled, together with another page detailing nearbymovies and show-times, and a further page with a local weather forecast.These can all be sent to the user's phone and cached for later display(e.g., by scrolling a thumb wheel on the side of the phone).

These actions can likewise be undertaken before any image processingoccurs—simply based on the geocode data accompanying the cell phoneimage.

While geocoding data accompanying the cell phone image was used in thearrangement particularly described, this is not necessary. Otherembodiments can select sets of reference images based on other criteria,such as image similarity. (This may be determined by various metrics, asindicated above and also detailed below. Known image classificationtechniques can also be used to determine one of several classes ofimages into which the input image falls, so that similarly-classedimages can then be retrieved.) Another criteria is the IP address fromwhich the input image is uploaded. Other images uploaded from thesame—or geographically-proximate—IP addresses, can be sampled to formthe reference sets.

Even in the absence of geocode data for the input image, the referencesets of imagery may nonetheless be compiled based on location. Locationinformation for the input image can be inferred from various indirecttechniques. A wireless service provider thru which a cell phone image isrelayed may identify the particular cell tower from which the tourist'stransmission was received. (If the transmission originated throughanother wireless link, such as WiFi, its location may also be known.)The tourist may have used his credit card an hour earlier at a Manhattanhotel, allowing the system (with appropriate privacy safeguards) toinfer that the picture was taken somewhere near Manhattan. Sometimesfeatures depicted in an image are so iconic that a quick search forsimilar images in Flickr can locate the user (e.g., as being at theEiffel Tower, or at the Statue of Liberty).

GeoPlanet was cited as one source of geographic information. However, anumber of other geoinformation databases can alternatively be used.GeoNames-dot-org is one. (It will be recognized that the “-dot-”convention, and omission of the usual http preamble, is used to preventthe reproduction of this text by the Patent Office from being indicatedas a live hyperlink) In addition to providing place names for a givenlatitude/longitude (at levels of neighborhood, city, state, country),and providing parent, child, and sibling information for geographicdivisions, GeoNames' free data (available as a web service) alsoprovides functions such as finding the nearest intersection, finding thenearest post office, finding the surface elevation, etc. Still anotheroption is Google's GeoSearch API, which allows retrieval of andinteraction with data from Google Earth and Google Maps.

It will be recognized that archives of aerial imagery are growingexponentially. Part of such imagery is from a straight-down perspective,but off-axis the imagery increasingly becomes oblique. From two or moredifferent oblique views of a location, a 3D model can be created. As theresolution of such imagery increases, sufficiently rich sets of data areavailable that—for some locations—a view of a scene as if taken fromground level may be synthesized. Such views can be matched with streetlevel photos, and metadata from one can augment metadata for the other.

As shown in FIG. 47, the embodiment particularly described above madeuse of various resources, including Flickr, a database of person names,a word frequency database, etc. These are just a few of the manydifferent information sources that might be employed in sucharrangements. Other social networking sites, shopping sites (e.g.,Amazon, EBay), weather and traffic sites, online thesauruses, caches ofrecently-visited web pages, browsing history, cookie collections,Google, other digital repositories (as detailed herein), etc., can allprovide a wealth of additional information that can be applied to theintended tasks. Some of this data reveals information about the user'sinterests, habits and preferences—data that can be used to better inferthe contents of the snapped picture, and to better tailor the intuitedresponse(s).

Likewise, while FIG. 47 shows a few lines interconnecting the differentitems, these are illustrative only. Different interconnections cannaturally be employed.

The arrangements detailed in this specification are a particular few outof myriad that may be employed. Most embodiments will be different thanthe ones detailed. Some actions will be omitted, some will performed indifferent orders, some will be performed in parallel rather thanserially (and vice versa), some additional actions may be included, etc.

One additional action is to refine the just-detailed process byreceiving user-related input, e.g., after the processing of the firstset of Flickr images. For example, the system identified “RockefellerCenter,” “Prometheus,” and “Skating rink” as relevant metadata to theuser-snapped image. The system may query the user as to which of theseterms is most relevant (or least relevant) to his/her particularinterest. The further processing (e.g., further search, etc.) can befocused accordingly.

Within an image presented on a touch screen, the user may touch a regionto indicate an object of particular relevance within the image frame.Image analysis and subsequent acts can then focus on the identifiedobject.

Some of the database searches can be iterative/recursive. For example,results from one database search can be combined with the originalsearch inputs and used as inputs for a further search.

It will be recognized that much of the foregoing processing is fuzzy.Much of the data may be in terms of metrics that have no absolutemeaning, but are relevant only to the extent different from othermetrics. Many such different probabilistic factors can be assessed andthen combined—a statistical stew. Artisans will recognize that theparticular implementation suitable for a given situation may be largelyarbitrary. However, through experience and Bayesian techniques, moreinformed manners of weighting and using the different factors can beidentified and eventually used.

If the Flickr archive is large enough, the first set of images in thearrangement detailed above may be selectively chosen to more likely besimilar to the subject image. For example, Flickr can be searched forimages taken at about the same time of day. Lighting conditions will beroughly similar, e.g., so that matching a night scene to a daylightscene is avoided, and shadow/shading conditions might be similar.Likewise, Flickr can be searched for images taken in the sameseason/month. Issues such as seasonal disappearance of the ice skatingrink at Rockefeller Center, and snow on a winter landscape, can thus bemitigated. Similarly, if the camera/phone is equipped with amagnetometer, inertial sensor, or other technology permitting itsbearing (and/or azimuth/elevation) to be determined, then Flickr can besearched for shots with this degree of similarity too.

Moreover, the sets of reference images collected from Flickr desirablycomprise images from many different sources (photographers)—so theydon't tend towards use of the same metadata descriptors.

Images collected from Flickr may be screened for adequate metadata. Forexample, images with no metadata (except, perhaps, an arbitrary imagenumber) may be removed from the reference set(s). Likewise, images withless than 2 (or 20) metadata terms, or without a narrative description,may be disregarded.

Flickr is often mentioned in this specification, but other collectionsof content can of course be used. Images in Flickr commonly havespecified license rights for each image. These include “all rightsreserved,” as well as a variety of Creative Commons licenses, throughwhich the public can make use of the imagery on different terms. Systemsdetailed herein can limit their searches through Flickr for imagerymeeting specified license criteria (e.g., disregard images marked “allrights reserved”).

Other image collections are in some respects preferable. For example,the database at images.google-dot-com seems better at ranking imagesbased on metadata-relevance than Flickr.

Flickr and Google maintain image archives that are publicly accessible.Many other image archives are private. The present technology findsapplication with both—including some hybrid contexts in which bothpublic and proprietary image collections are used (e.g., Flickr is usedto find an image based on a user image, and the Flickr image issubmitted to a private database to find a match and determine acorresponding response for the user).

Similarly, while reference was made to services such as Flickr forproviding data (e.g., images and metadata), other sources can of coursebe used.

One alternative source is an ad hoc peer-to-peer (P2P) network. In onesuch P2P arrangement, there may optionally be a central index, withwhich peers can communicate in searching for desired content, anddetailing the content they have available for sharing. The index mayinclude metadata and metrics for images, together with pointers to thenodes at which the images themselves are stored.

The peers may include cameras, PDAs, and other portable devices, fromwhich image information may be available nearly instantly after it hasbeen captured.

In the course of the methods detailed herein, certain relationships arediscovered between imagery (e.g., similar geolocation; similar imagemetrics; similar metadata, etc). These data are generally reciprocal, soif the system discovers—during processing of Image A, that its colorhistogram is similar to that of Image B, then this information can bestored for later use. If a later process involves Image B, theearlier-stored information can be consulted to discover that Image A hasa similar histogram—without analyzing Image B. Such relationships areakin to virtual links between the images.

For such relationship information to maintain its utility over time, itis desirable that the images be identified in a persistent manner. If arelationship is discovered while Image A is on a user's PDA, and Image Bis on a desktop somewhere, a means should be provided to identify ImageA even after it has been transferred to the user's MySpace account, andto track Image B after it has been archived to an anonymous computer ina cloud network.

Images can be assigned Digital Object Identifiers (DOI) for thispurpose. The International DOI Foundation has implemented the CNRIHandle System so that such resources can be resolved to their currentlocation through the web site at doi-dot-org. Another alternative is forthe images to be assigned and digitally watermarked with identifierstracked by Digimarc For Images service.

If several different repositories are being searched for imagery orother information, it is often desirable to adapt the query to theparticular databases being used. For example, different facialrecognition databases may use different facial recognition parameters.To search across multiple databases, technologies such as detailed inDigimarc's published patent applications 20040243567 and 20060020630 canbe employed to ensure that each database is probed with anappropriately-tailored query.

Frequent reference has been made to images, but in many cases otherinformation may be used in lieu of image information itself. Indifferent applications image identifiers, characterizing eigenvectors,color histograms, keypoint descriptors, FFTs, associated metadata,decoded barcode or watermark data, etc., may be used instead of imagery,per se (e.g., as a data proxy).

While the earlier example spoke of geocoding by latitude/longitude data,in other arrangements the cell phone/camera may provide location data inone or more other reference systems, such as Yahoo's GeoPlanet ID—theWhere on Earth ID (WOEID).

Location metadata can be used for identifying other resources inaddition to similarly-located imagery. Web pages, for example, can havegeographical associations (e.g., a blog may concern the author'sneighborhood; a restaurant's web page is associated with a particularphysical address). The web service GeoURL-dot-org is a location-to-URLreverse directory that can be used to identify web sites associated withparticular geographies.

GeoURL supports a variety of location tags, including their own ICMBmeta tags, as well as Geo Tags. Other systems that support geotagginginclude RDF, Geo microformat, and the GPSLongitude/GPSLatitude tagscommonly used in XMP- and EXIF-camera metainformation. Flickr uses asyntax established by Geobloggers, e.g.

geotagged geo:lat=57.64911 geo:lon=10.40744

In processing metadata, it is sometimes helpful to clean-up the dataprior to analysis, as referenced above. The metadata may also beexamined for dominant language, and if not English (or other particularlanguage of the implementation), the metadata and the associated imagemay be removed from consideration.

While the earlier-detailed embodiment sought to identify the imagesubject as being one of a person/place/thing so that acorrespondingly-different action can be taken, analysis/identificationof the image within other classes can naturally be employed. A fewexamples of the countless other class/type groupings includeanimal/vegetable/mineral; golf/tennis/football/baseball; male/female;wedding-ring-detected/wedding-ring-not-detected; urban/rural;rainy/clear; day/night; child/adult; summer/autumn/winter/spring;car/truck; consumer product/non-consumer product; can/box/bag;natural/man-made; suitable for all ages/parental advisory for children13 and below/parental advisory for children 17 and below/adult only;etc.

Sometimes different analysis engines may be applied to the user's imagedata. These engines can operate sequentially, or in parallel. Forexample, FIG. 48A shows an arrangement in which—if an image isidentified as person-centric—it is next referred to two other engines.One identifies the person as family, friend or stranger. The otheridentifies the person as child or adult. The latter two engines work inparallel, after the first has completed its work.

Sometimes engines can be employed without any certainty that they areapplicable. For example, FIG. 48B shows engines performingfamily/friend/stranger and child/adult analyses—at the same time theperson/place/thing engine is undertaking its analysis. If the latterengine determines the image is likely a place or thing, the results ofthe first two engines will likely not be used.

(Specialized online services can be used for certain types of imagediscrimination/identification. For example, one web site may provide anairplane recognition service: when an image of an aircraft is uploadedto the site, it returns an identification of the plane by make andmodel. (Such technology can follow teachings, e.g., of Sun, The FeaturesVector Research on Target Recognition of Airplane, JCIS-2008Proceedings; and Tien, Using Invariants to Recognize Airplanes inInverse Synthetic Aperture Radar Images, Optical Engineering, Vol. 42,No. 1, 2003.) The arrangements detailed herein can refer imagery thatappears to be of aircraft to such a site, and use the returnedidentification information. Or all input imagery can be referred to sucha site; most of the returned results will be ambiguous and will not beused.)

FIG. 49 shows that different analysis engines may provide their outputsto different response engines. Often the different analysis engines andresponse engines may be operated by different service providers. Theoutputs from these response engines can then be consolidated/coordinatedfor presentation to the consumer. (This consolidation may be performedby the user's cell phone—assembling inputs from different data sources;or such task can be performed by a processor elsewhere.)

One example of the technology detailed herein is a homebuilder who takesa cell phone image of a drill that needs a spare part. The image isanalyzed, the drill is identified by the system as a Black and DeckerDR250B, and the user is provided various info/action options. Theseinclude reviewing photos of drills with similar appearance, reviewingphotos of drills with similar descriptors/features, reviewing the user'smanual for the drill, seeing a parts list for the drill, buying thedrill new from Amazon or used from EBay, listing the builder's drill onEBay, buying parts for the drill, etc. The builder chooses the “buyingparts” option and proceeds to order the necessary part. (FIG. 41.)

Another example is a person shopping for a home. She snaps a photo ofthe house. The system refers the image both to a private database of MLSinformation, and a public database such as Google. The system respondswith a variety of options, including reviewing photos of the nearesthouses offered for sale; reviewing photos of houses listed for sale thatare closest in value to the pictured home, and within the same zip-code;reviewing photos of houses listed for sale that are most similar infeatures to the pictured home, and within the same zip-code;neighborhood and school information, etc. (FIG. 43.)

In another example, a first user snaps an image of Paul Simon at aconcert. The system automatically posts the image to the user's Flickraccount—together with metadata inferred by the procedures detailedabove. (The name of the artist may have been found in a search of Googlefor the user's geolocation; e.g., a Ticketmaster web page revealed thatPaul Simon was playing that venue that night.) The first user's picture,a moment later, is encountered by a system processing a secondconcert-goer's photo of the same event, from a different vantage. Thesecond user is shown the first user's photo as one of the system'sresponses to the second photo. The system may also alert the first userthat another picture of the same event—from a different viewpoint—isavailable for review on his cell phone, if he'll press a certain buttontwice.

In many such arrangements, it will be recognized that “the content isthe network.” Associated with each photo, or each subject depicted in aphoto (or any other item of digital content or information expressedtherein), is a set of data and attributes that serve as implicit- orexpress-links to actions and other content. The user can navigate fromone to the next—navigating between nodes on a network.

Television shows are rated by the number of viewers, and academic papersare judged by the number of later citations. Abstracted to a higherlevel, it will be recognized that such “audience measurement” forphysical- or virtual-content is the census of links that associate itwith other physical- or virtual-content.

While Google is limited to analysis and exploitation of links betweendigital content, the technology detailed herein allows the analysis andexploitation of links between physical content as well (and betweenphysical and electronic content).

Known cell phone cameras and other imaging devices typically have asingle “shutter” button. However, the device may be provided withdifferent actuator buttons—each invoking a different operation with thecaptured image information. By this arrangement, the user canindicate—at the outset—the type of action intended (e.g., identify facesin image per Picasa or VideoSurf information, and post to my FaceBookpage; or try and identify the depicted person, and send a “friendrequest” to that person's MySpace account).

Rather than multiple actuator buttons, the function of a sole actuatorbutton can be controlled in accordance with other UI controls on thedevice. For example, repeated pressing of a Function Select button cancause different intended operations to be displayed on the screen of theUI (just as familiar consumer cameras have different photo modes, suchas Close-up, Beach, Nighttime, Portrait, etc.). When the user thenpresses the shutter button, the selected operation is invoked.

One common response (which may need no confirmation) is to post theimage on Flickr or social network site(s). Metadata inferred by theprocesses detailed herein can be saved in conjunction with the imagery(qualified, perhaps, as to its confidence).

In the past, the “click” of a mouse served to trigger a user-desiredaction. That action identified an X-Y-location coordinate on a virtuallandscape (e.g., a desktop screen) that indicated the user's expressintention. Going forward, this role will increasingly be served by the“snap” of a shutter—capturing a real landscape from which a user'sintention will be inferred.

Business rules can dictate a response appropriate to a given situation.These rules and responses may be determined by reference to datacollected by web indexers, such as Google, etc., using intelligentrouting.

Crowdsourcing is not generally suitable for real-time implementations.However, inputs that stymie the system and fail to yield a correspondingaction (or yield actions from which user selects none) can be referredoffline for crowdsource analysis—so that next time it's presented, itcan be handled better.

Image-based navigation systems present a different topology than isfamiliar from web page-based navigation system. FIG. 37A shows that webpages on the internet relate in a point-to-point fashion. For example,web page 1 may link to web pages 2 and 3. Web page 3 may link to page 2.Web page 2 may link to page 4. Etc. FIG. 37B shows the contrastingnetwork associated with image-based navigation. The individual imagesare linked a central node (e.g., a router), which then links to furthernodes (e.g., response engines) in accordance with the image information.

The “router” here does not simply route an input packet to a destinationdetermined by address information conveyed with the packet—as in thefamiliar case with internet traffic routers. Rather, the router takesimage information and decides what to do with it, e.g., to whichresponsive system should the image information be referred.

Routers can be stand-alone nodes on a network, or they can be integratedwith other devices. (Or their functionality can be distributed betweensuch locations.) A wearable computer may have a router portion (e.g., aset of software instructions)—which takes image information from thecomputer, and decides how it should be handled. (For example, if itrecognizes the image information as being an image of a business card,it may OCR name, phone number, and other data, and enter it into acontacts database.) The particular response for different types of inputimage information can be determined by a registry database, e.g., of thesort maintained by a computer's operating system, or otherwise.

Likewise, while response engines can be stand-alone nodes on a network,they can also be integrated with other devices (or their functionsdistributed). A wearable computer may have one or several differentresponse engines that take action on information provided by the routerportion.

FIG. 52 shows an arrangement employing several computers (A-E), some ofwhich may be wearable computers (e.g., cell phones). The computersinclude the usual complement of processor, memory, storage,input/output, etc. The storage or memory can contain content, such asimages, audio and video. The computers can also include one or morerouters and/or response engines. Standalone routers and response enginesmay also be coupled to the network

The computers are networked, shown schematically by link 150. Thisconnection can be by any known networking arrangement, including theinternet and/or wireless links (WiFi, WiMax, Bluetooth, etc), Softwarein at least certain of the computers includes a peer-to-peer (P2P)client, which makes at least some of that computer's resources availableto other computers on the network, and reciprocally enables thatcomputer to employ certain resources of the other computers.

Though the P2P client, computer A may obtain image, video and audiocontent from computer B. Sharing parameters on computer B can be set todetermine which content is shared, and with whom. Data on computer B mayspecify, for example, that some content is to be kept private; some maybe shared with known parties (e.g., a tier of social network “Friends”);and other may be freely shared. (Other information, such as geographicposition information, may also be shared—subject to such parameters.)

In addition to setting sharing parameters based on party, the sharingparameters may also specify sharing based on the content age. Forexample, content/information older than a year might be shared freely,and content older than a month might be shared with a tier of friends(or in accordance with other rule-based restrictions). In otherarrangements, fresher content might be the type most liberally shared.E.g., content captured or stored within the past hour, day or week mightbe shared freely, and content from within the past month or year mightbe shared with friends.

An exception list can identify content—or one or more classes ofcontent—that is treated differently than the above-detailed rules (e.g.,never shared or always shared).

In addition to sharing content, the computers can also share theirrespective router and response engine resources across the network.Thus, for example, if computer A does not have a response enginesuitable for a certain type of image information, it can pass theinformation to computer B for handling by its response engine.

It will be recognized that such a distributed architecture has a numberof advantages, in terms of reduced cost and increased reliability.Additionally, the “peer” groupings can be defined geographically, e.g.,computers that find themselves within a particular spatial environment(e.g., an area served by a particular WiFi system). The peers can thusestablish dynamic, ad hoc subscriptions to content and services fromnearby computers. When the computer leaves that environment, the sessionends.

Some researchers foresee the day when all of our experiences arecaptured in digital form. Indeed, Gordon Bell at Microsoft has compileda digital archive of his recent existence through his technologiesCyberAll, SenseCam and MyLifeBits. Included in Bell's archive arerecordings of all telephone calls, video of daily life, captures of allTV and radio consumed, archive of all web pages visited, map data of allplaces visited, polysomnograms for his sleep apnea, etc., etc., etc.(For further information see, e.g., at Bell, A Digital Life, ScientificAmerican, March, 2007; Gemmell, MyLifeBits: A Personal Database forEverything, Microsoft Research Technical Report MSR-TR-2006-23; Gemmell,Passive Capture and Ensuing Issues for a Personal Lifetime Store,Proceedings of The First ACM Workshop on Continuous Archival andRetrieval of Personal Experiences (CARPE '04), pp. 48-55; Wilkinson,Remember This, The New Yorker, May 27, 2007. See also the otherreferences cited at Gordon's Bell's Microsoft Research web page, and theACM Special Interest Group web page for CARPE (Capture, Archival &Retrieval of Personal Experiences).)

The present technology is well suited for use with such experientialdigital content—either as input to a system (i.e., the system respondsto the user's present experience), or as a resource from which metadata,habits, and other attributes can be mined (including service in the roleof the Flickr archive in the embodiments earlier detailed).

In embodiments that employ personal experience as an input, it isinitially desirable to have the system trigger and respond only whendesired by the user—rather than being constantly free-running (which iscurrently prohibitive from the standpoint of processing, memory andbandwidth issues).

The user's desire can be expressed by a deliberate action by the user,e.g., pushing a button, or making a gesture with head or hand. Thesystem takes data from the current experiential environment, andprovides candidate responses.

More interesting, perhaps, are systems that determine the user'sinterest through biological sensors. Electroencephalography, forexample, can be used to generate a signal that triggers the system'sresponse (or triggers one of several different responses, e.g.,responsive to different stimuli in the current environment). Skinconductivity, pupil dilation, and other autonomous physiologicalresponses can also be optically or electrically sensed, and provide atriggering signal to the system.

Eye tracking technology can be employed to identify which object in afield of view captured by an experiential-video sensor is of interest tothe user. If Tony is sitting in a bar, and his eye falls on a bottle ofunusual beer in front of a nearby woman, the system can identify hispoint of focal attention, and focus its own processing efforts on pixelscorresponding to that bottle. With a signal from Tony, such as two quickeye-blinks, the system can launch an effort to provide candidateresponses based on that beer bottle—perhaps also informed by otherinformation gleaned from the environment (time of day, date, ambientaudio, etc.) as well as Tony's own personal profile data. (Gazerecognition and related technology is disclosed, e.g., in Apple's patentpublication 20080211766.)

The system may quickly identify the beer as Doppelbock, e.g., by patternmatching from the image (and/or OCR). With that identifier it findsother resources indicating the beer originates from Bavaria, where it isbrewed by monks of St. Francis of Paula. Its 9% alcohol content also isdistinctive.

By checking personal experiential archives that friends have madeavailable to Tony, the system learns that his buddy Geoff is fond ofDoppelbock, and most recently drank a bottle in a pub in Dublin. Tony'sglancing encounter with the bottle is logged in his own experientialarchive, where Geoff may later encounter same. The fact of the encountermay also be real-time-relayed to Geoff in Prague, helping populate anon-going data feed about his friends' activities.

The bar may also provide an experiential data server, to which Tony iswirelessly granted access. The server maintains an archive of digitaldata captured in the bar, and contributed by patrons. The server mayalso be primed with related metadata & information the management mightconsider of interest to its patrons, such as the Wikipedia page on thebrewing methods of the monks of St Paul, what bands might be playing inweeks to come, or what the night's specials are. (Per user preference,some users require that their data be cleared when they leave the bar;others permit the data to be retained.) Tony's system may routinelycheck the local environment's experiential data server to see what oddbits of information might be found. This time it shows that the woman atbarstool 3 (who might employ a range privacy heuristics to know whereand with whom to share her information; in this example she might screenher identity from strangers)—the woman with the Doppelbock—has, amongher friends, a Tom <last name encrypted>. Tony's system recognizes thatGeoff's circle of friends (which Geoff makes available to his friends)includes the same Tom.

A few seconds after his double-blink, Tony's cell phone vibrates on hisbelt. Flipping it open and turning the scroll wheel on the side, Tonyreviews a series of screens on which the system presents information ithas gathered—with the information it deems most useful to Tony shownfirst.

Equipped with knowledge about this Tony-Geoff-Tom connection (closerthan the usual six-degrees-of-separation), and primed with trivia abouther Doppelbock beer, Tony picks up his glass and walks down the bar.

(Additional details that can be employed in such arrangements, includinguser interfaces and visualization techniques, can be found in Dunekacke,“Localized Communication with Mobile Devices,” MobileHCI, 2009.)

While P2P networks such as BitTorrent have permitted sharing of audio,image and video content, arrangements like that shown in FIG. 52 allownetworks to share a contextually-richer set of experiential content. Abasic tenet of P2P networks is that even in the face of technologiesthat that mine the long-tail of content, the vast majority of users areinterested in similar content (the score of tonight's NBA game, thecurrent episode of Lost, etc.), and that given sufficient bandwidth andprotocols, the most efficient mechanism to deliver similar content tousers is not by sending individual streams, but by piecing the contenttogether based on what your “neighbors” have on the network. This samemechanism can be used to provide metadata related to enhancing anexperience, such as being at the bar drinking a Dopplebock, or watchinga highlight of tonight's NBA game on a phone while at the bar. Theprotocol used in the ad-hoc network described above might leverage P2Pprotocols with the experience server providing a peer registrationservice (similar to early P2P networks) or in a true P2P modality, withall devices in the ad-hoc network advertising what experiences(metadata, content, social connections, etc.) they have available,either for free, for payment, or for barter of information in-kind, etc.Apple's Bonjour software is well suited for this sort of application.

Within this fabric, Tony's cell phone may simply retrieve theinformation on Dopplebock by posting the question to the peer networkand receive a wealth of information from a variety of devices within thebar or the experience server, without ever knowing the source.Similarly, the experience server may also act as data-recorder,recording the experiences of those within the ad-hoc network, providinga persistence to experience in time and place. Geoff may visit the samebar at some point in the future and see what threads of communication orconnections his friend Tony made two weeks earlier, or possibly evenleave a note for Tony to retrieve a future time next time he is at thebar.

The ability to mine the social threads represented by the traffic on thenetwork, can also enable the proprietors of the bar to augment theexperiences of the patrons by orchestrating interaction orintroductions. This may include people with shared interests, singles,etc. or in the form of gaming by allowing people to opt-in to themebased games, where patrons piece together clues to find the trueidentity of someone in the bar or unravel a mystery (similar to theboard game Clue). Finally, the demographic information as it relates toaudience measurement is of material value to proprietors as theyconsider which beers to stock next, where to advertise, etc.

Still Further Discussion

Certain portable devices, such as the Apple iPhone, offer single-buttonaccess to pre-defined functions. Among these are viewing prices offavorite stocks, viewing a weather forecast, and viewing a general mapof the user's location. Additional functions are available, but the usermust undertake a series of additional manipulations, e.g., to reach afavorite web site, etc.

An embodiment of the present technology allows these furthermanipulations to be shortcut by capturing distinctive imagery. Capturingan image of the user's hand may link the user to a babycam backhome—delivering real time video of a newborn in a crib. Capturing animage of a wristwatch may load a map showing traffic conditions alongsome part of a route on the user's drive home, etc. Such functionalityis shown in FIGS. 53-55.

A user interface for the portable device includes a set-up/trainingphase that allows the user to associate different functions withdifferent visual signs. The user is prompted to capture a picture, andenter the URL and name of an action that is to be associated with thedepicted object. (The URL is one type of response; others can also beused—such as launching a JAVA application, etc.)

The system then characterizes the snapped image by deriving a set offeature vectors by which similar images can be recognized (e.g., thrupattern/template matching). The feature vectors are stored in a datastructure (FIG. 55), in association with the function name andassociated URL.

In this initial training phase, the user may capture several images ofthe same visual sign—perhaps from different distances and perspectives,and with different lighting and backgrounds. The feature extractionalgorithm processes the collection to extract a feature set thatcaptures shared similarities of all of the training images.

The extraction of image features, and storage of the data structure, canbe performed at the portable device, or at a remote device (or indistributed fashion).

In later operation, the device can check each image captured by thedevice for correspondence with one of the stored visual signs. If any isrecognized, the corresponding action can be launched. Else, the deviceresponds with the other functions available to the user upon capturing anew image.

In another embodiment, the portable device is equipped with two or moreshutter buttons. Manipulation of one button captures an image andexecutes an action—based on a closest match between the captured imageand a stored visual sign. Manipulation of another button captures animage without undertaking such an action.

The device UI can include a control that presents a visual glossary ofsigns to the user, as shown in FIG. 54. When activated, thumbnails ofdifferent visual signs are presented on the device display, inassociation with names of the functions earlier stored—reminding theuser of the defined vocabulary of signs.

The control that launches this glossary of signs can—itself—be an image.One image suitable for this function is a generally featureless frame.An all-dark frame can be achieved by operating the shutter with the lenscovered. An all-light frame can be achieved by operating the shutterwith the lens pointing at a light source. Another substantiallyfeatureless frame (of intermediate density) may be achieved by imaging apatch of skin, or wall, or sky. (To be substantially featureless, theframe should be closer to featureless than matching one of the otherstored visual signs. In other embodiments, “featureless” can beconcluded if the image has a texture metric below a threshold value.)

(The concept of triggering an operation by capturing an all-light framecan be extended to any device function. In some embodiments, repeatedall-light exposures alternatively toggle the function on and off.Likewise with all-dark and intermediate density frames. A threshold canbe set—by the user with a UI control, or by the manufacturer—toestablish how “light” or “dark” such a frame must be in order to beinterpreted as a command. For example, 8-bit (0-255) pixel values from amillion pixel sensor can be summed. If the sum is less than 900,000, theframe may be regarded as all-dark. If greater than 254 million, theframe may be regarded as all-light. Etc.)

One of the other featureless frames can trigger another specialresponse. It can cause the portable device to launch all of the storedfunctions/URLs (or, e.g., a certain five or ten) in the glossary. Thedevice can cache the resulting frames of information, and present themsuccessively when the user operates one of the phone controls, such asbutton 116 b or scroll wheel 124 in FIG. 44, or makes a certain gestureon a touch screen. (This function can be invoked by other controls aswell.)

The third of the featureless frames (i.e., dark, white, or mid-density)can send the device's location to a map server, which can then transmitback multiple map views of the user's location. These views may includeaerial views and street map views at different zoom levels, togetherwith nearby street-level imagery. Each of these frames can be cached atthe device, and quickly reviewed by turning a scroll wheel or other UIcontrol.

The user interface desirably includes controls for deleting visualsigns, and editing the name/functionality assigned to each. The URLs canbe defined by typing on a keypad, or by navigating otherwise to adesired destination and then saving that destination as the responsecorresponding to a particular image.

Training of the pattern recognition engine can continue through use,with successive images of the different visual signs each serving torefine the template model by which that visual sign is defined.

It will be recognized that a great variety of different visual signs canbe defined, using resources that are commonly available to the user. Ahand can define many different signs, with fingers arranged in differentpositions (fist, one-through five-fingers, thumb-forefinger OK sign,open palm, thumbs-up, American sign language signs, etc). Apparel andits components (e.g., shoes, buttons) can also be used, as can jewelry.Features from common surroundings (e.g., telephone) may also be used.

In addition to launching particular favorite operations, such techniquescan be used as a user interface technique in other situations. Forexample, a software program or web service may present a list of optionsto the user. Rather than manipulating a keyboard to enter, e.g., choice#3, the user may capture an image of three fingers—visually symbolizingthe selection. Software recognizes the three finger symbol as meaningthe digit 3, and inputs that value to the process.

If desired, visual signs can form part of authentication procedures,e.g., to access a bank or social-networking web site. For example, afterentering a sign-on name or password at a site, the user may be shown astored image (to confirm that the site is authentic) and then beprompted to submit an image of a particular visual type (earlier definedby the user, but not now specifically prompted by the site). The website checks features extracted from the just-captured image forcorrespondence with an expected response, before permitting the user toaccess the web site.

Other embodiments can respond to a sequence of snapshots within acertain period (e.g., 10 seconds)—a grammar of imagery. An imagesequence of “wristwatch,” “four fingers” “three fingers” can set analarm clock function on the portable device to chime at 7 am.

In still other embodiments, the visual signs may be gestures thatinclude motion—captured as a sequence of frames (e.g., video) by theportable device.

Context data (e.g., indicating the user's geographic location, time ofday, month, etc.) can also be used to tailor the response. For example,when a user is at work, the response to a certain visual sign may be tofetch an image from a security camera from the user's home. At home, theresponse to the same sign may be to fetch an image from a securitycamera at work.

In this embodiment, as in others, the response needn't be visual. Audioor other output (e.g., tactile, smell, etc.) can of course be employed.

The just-described technology allows a user to define a glossary ofvisual signs and corresponding customized responses. An intendedresponse can be quickly invoked by imaging a readily-available subject.The captured image can be of low quality (e.g., overexposed, blurry),since it only needs to be classified among, and distinguished from, arelatively small universe of alternatives.

Visual Intelligence Pre-Processing

Another aspect of the present technology is to perform one or morevisual intelligence pre-processing operations on image informationcaptured by a camera sensor. These operations may be performed withoutuser request, and before other image processing operations that thecamera customarily performs.

FIG. 56 is a simplified diagram showing certain of the processingperformed in an exemplary camera, such as a cell phone camera. Lightimpinges on an image sensor comprising an array of photodiodes. (CCD orCMOS sensor technologies are commonly used.) The resulting analogelectrical signals are amplified, and converted to digital form by D/Aconverters. The outputs of these D/A converters provide image data inits most raw, or “native,” form.

The foregoing operations are typically performed by circuitry formed ona common substrate, i.e., “on-chip.” Before other processes can accessthe image data, one or more other processes are commonly performed.

One such further operation is Bayer interpolation (de-mosaicing). Thephotodiodes of the sensor array typically each captures only a singlecolor of light: red, green or blue (R/G/B), due to a color filter array.This array is comprised of a tiled 2×2 pattern of filter elements: onered, a diagonally-opposite one blue, and the other two green. Bayerinterpolation effectively “fills in the blanks” of the sensor'sresulting R/G/B mosaic pattern, e.g., providing a red signal where thereis a blue filter, etc.

Another common operation is white balance correction. This processadjusts the intensities of the component R/G/B colors in order to rendercertain colors (especially neutral colors) correctly.

Other operations that may be performed include gamma correction and edgeenhancement.

Finally, the processed image data is typically compressed to reducestorage requirements. JPEG compression is most commonly used.

The processed, compressed image data is then stored in a buffer memory.Only at this point is the image information commonly available to otherprocesses and services of the cell phone (e.g., by calling a systemAPI).

One such process that is commonly invoked with this processed image datais to present the image to the user on the screen of the camera. Theuser can then assess the image and decide, e.g., whether (1) to save itto the camera's memory card, (2) to transmit it in a picture message,(3) to delete it, etc.

Until the user instructs the camera (e.g., through a control in agraphical or button-based user interface), the image stays in the buffermemory. Without further instruction, the only use made of the processedimage data is to display same on the screen of the cell phone.

FIG. 57 shows an exemplary embodiment of the presently-discussed aspectof the technology. After converting the analog signals into digitalnative form, one or more other processes are performed.

One such process is to perform a Fourier transformation (e.g., an FFT)on the native image data. This converts the spatial-domainrepresentation of the image into a frequency-domain representation.

A Fourier-domain representation of the native image data can be usefulin various ways. One is to screen the image for likely barcode data.

One familiar 2D barcode is a checkerboard-like array of light- anddark-squares. The size of the component squares, and thus theirrepetition spacing, gives a pair of notable peaks in the Fourier-domainrepresentation of the image at a corresponding frequency. (The peaks maybe phase-spaced ninety degrees in the UV plane, if the pattern recurs inequal frequency in both the vertical and horizontal directions.) Thesepeaks extend significantly above other image components at nearby imagefrequencies—with the peaks often having a magnitude twice- to five- orten-times (or more) that of nearby image frequencies. If the Fouriertransformation is done on tiled patches from the image (e.g., patches of16×16 pixels, or 128×128 pixels, etc), it may be found that certainpatches that are wholly within a barcode portion of the image frame haveessentially no signal energy except at this characteristic frequency.

As shown in FIG. 57, Fourier transform information can be analyzed fortelltale signs associated with an image of a barcode. A template-likeapproach can be used. The template can comprise a set of parametersagainst which the Fourier transform information is tested—to see if thedata has indicia associated with a barcode-like pattern.

If the Fourier data is consistent with an image depicting a 2D barcode,corresponding information can be routed for further processing (e.g.,sent from the cell phone to a barcode-responsive service). Thisinformation can comprise the native image data, and/or the Fouriertransform information derived from the image data.

In the former case, the full image data needn't be sent. In someembodiments a down-sampled version of the image data, e.g., one-fourththe resolution in both the horizontal and vertical directions, can besent. Or just patches of the image data having the highest likelihood ofdepicting part of a barcode pattern can be sent. Or, conversely, patchesof the image data having the lowest likelihood of depicting a barcodecan not be sent. (These may be patches having no peak at thecharacteristic frequency, or having a lower amplitude there thannearby.)

The transmission can be prompted by the user. For example, the camera UImay ask the user if information should be directed for barcodeprocessing. In other arrangements, the transmission is dispatchedimmediately upon a determination that the image frame matches thetemplate, indicating possible barcode data. No user action is involved.

The Fourier transform data can be tested for signs of other imagesubjects as well. A 1D barcode, for example, is characterized by asignificant amplitude component at a high frequency—(going “across thepickets,” and another significant amplitude spike at a lowfrequency—going along the pickets. (Significant again means two-or-moretimes the amplitude of nearby frequencies, as noted above.) Other imagecontents can also be characterized by reference to their Fourier domainrepresentation, and corresponding templates can be devised. Fouriertransform data is also commonly used in computing fingerprints used forautomated recognition of media content.

The Fourier-Mellin (F-M) transform is also useful in characterizingvarious image subjects/components—including the barcodes noted above.The F-M transform has the advantage of being robust to scale androtation of the image subject (scale/rotation invariance). In anexemplary embodiment, if the scale of the subject increases (as bymoving the camera closer), the F-M transform pattern shifts up; if thescale decreases, the F-M pattern shifts down. Similarly, if the subjectis rotated clockwise, the F-M pattern shifts right; if rotatedcounter-clockwise, the F-M pattern shifts left. (The particulardirections of the shifts can be tailored depending on theimplementation.) These attributes make F-M data important in recognizingpatterns that may be affine-transformed, such as facial recognition,character recognition, object recognition, etc.

The arrangement shown in FIG. 57 applies a Mellin transform to theoutput of the Fourier transform process, to yield F-M data. The F-M canthen be screened for attributes associated with different imagesubjects.

For example, text is characterized by plural symbols of approximatelysimilar size, composed of strokes in a foreground color that contrastwith a larger background field. Vertical edges tend to dominate (albeitslightly inclined with italics), with significant energy also beingfound in the horizontal direction. Spacings between strokes usually fallwithin a fairly narrow range.

These attributes manifest themselves as characteristics that tend toreliably fall within certain boundaries in the F-M transform space.Again, a template can define tests by which the F-M data is screened toindicate the likely presence of text in the captured native image data.If the image is determined to include likely-text, it can be dispatchedto a service that handles this type of data (e.g., an optical characterrecognition, or OCR, engine). Again, the image (or a variant of theimage) can be sent, or the transform data can be sent, or some otherdata.

Just as text manifests itself with a certain set of characteristicattributes in the F-M, so do faces. The F-M data output from the Mellintransform can be tested against a different template to determine thelikely presence of a face within the captured image.

Likewise, the F-M data can be examined for tell-tale signs that theimage data conveys a watermark. A watermark orientation signal is adistinctive signal present in some watermarks that can serve as a signthat a watermark is present.

In the examples just given, as in others, the templates may be compiledby testing with known images (e.g., “training”). By capturing images ofmany different text presentations, the resulting transform data can beexamined for attributes that are consistent across the sample set, or(more likely) that fall within bounded ranges. These attributes can thenbe used as the template by which images containing likely-text areidentified. (Likewise for faces, barcodes, and other types of imagesubjects.)

FIG. 57 shows that a variety of different transforms can be applied tothe image data. These are generally shown as being performed inparallel, although one or more can be performed sequentially—either alloperating on the same input image data, or one transform using an outputof a previous transform (as is the case with the Mellin transform).Although not all shown (for clarity of illustration), outputs from eachof the other transform processes can be examined for characteristicsthat suggest the presence of a certain image type. If found, relateddata is then sent to a service appropriate to that type of imageinformation.

In addition to Fourier transform and Mellin transform processes,processes such as eigenface (eigenvector) calculation, imagecompression, cropping, affine distortion, filtering, DCT transform,wavelet transform, Gabor transform, and other signal processingoperations can be applied (all are regarded as transforms). Others arenoted elsewhere in this specification, and in the documents incorporatedby reference. Outputs from these processes are then tested forcharacteristics indicating that the chance the image depicts a certainclass of information, is greater than a random chance.

The outputs from some processes may be input to other processes. Forexample, an output from one of the boxes labeled ETC in FIG. 57 isprovided as an input to the Fourier transform process. This ETC box canbe, for example, a filtering operation. Sample filtering operationsinclude median, Laplacian, Wiener, Sobel, high-pass, low-pass, bandpass,Gabor, signum, etc. (Digimarc's U.S. Pat. Nos. 6,442,284, 6,483,927,6,516,079, 6,614,914, 6,631,198, 6,724,914, 6,988,202, 7,013,021 and7,076,082 show various such filters.)

Sometimes a single service may handle different data types, or data thatpasses different screens. In FIG. 57, for example, a facial recognitionservice may receive F-M transform data, or eigenface data. Or it mayreceive image information that has passed one of several differentscreens (e.g., its F-M transform passed one screen, or its eigenfacerepresentation passed a different screen).

In some cases, data can be sent to two or more different services.

Although not essential, it is desirable that some or all of theprocessing shown in FIG. 57 be performed by circuitry integrated on thesame substrate as the image sensors. (Some of the operations may beperformed by programmable hardware—either on the substrate oroff—responsive to software instructions.)

While the foregoing operations are described as immediately followingconversion of the analog sensor signals to digital form, in otherembodiments such operations can be performed after other processingoperations (e.g., Bayer interpolation, white balance correction, JPEGcompression, etc.).

Some of the services to which information is sent may be providedlocally in the cell phone. Or they can be provided by a remote device,with which the cell phone establishes a link that is at least partlywireless. Or such processing can be distributed among various devices.

(While described in the context of conventional CCD and CMOS sensors,this technology is applicable regardless of sensor type. Thus, forexample, Foveon and panchromatic image sensors can alternately be used.So can high dynamic range sensors, and sensors using Kodak's TruesenseColor Filter Pattern (which add panchromatic sensor pixels to the usualBayer array of red/green/blue sensor pixels). Sensors with infraredoutput data can also advantageously be used. For example, sensors thatoutput infrared image data (in addition to visible image data, or not)can be used to identify faces and other image subjects with temperaturedifferentials—aiding in segmenting image subjects within the frame.)

It will be recognized that devices employing the FIG. 57 architecturehave, essentially, two parallel processing chains. One processing chainproduces data to be rendered into perceptual form for use by humanviewers. This chain typically includes at least one of a de-mosaicprocessor, a white balance module, and a JPEG image compressor, etc. Thesecond processing chain produces data to be analyzed by one or moremachine-implemented algorithms, and in the illustrative example includesa Fourier transform processor, an eigenface processor, etc.

Such processing architectures are further detailed in application61/176,739, cited earlier.

By arrangements such as the foregoing, one or more appropriateimage-responsive services can begin formulating candidate responses tothe visual stimuli before the user has even decided what to do with thecaptured image.

Further Comments on Visual Intelligence Pre-Processing

While static image pre-processing was discussed in connection with FIG.57 (and FIG. 50), such processing can also include temporal aspects,such as motion.

Motion is most commonly associated with video, and the techniquesdetailed herein can be used when capturing video content. However,motion/temporal implications are also present with “still” imagery.

For example, some image sensors are read sequentially, top row to bottomrow. During the reading operation, the image subject may move within theimage frame (i.e., due to camera movement or subject movement). Anexaggerated view of this effect is shown in FIG. 60, depicting an imaged“E” captured as the sensor is moved to the left. The vertical stroke ofthe letter is further from the left edge of the image frame at thebottom than the top, due to movement of the sensor while the pixel datais being clocked-out.

The phenomenon also arises when the camera assembles data from severalframes to generate a single “still” image. Often unknown to the user,many consumer imaging devices rapidly capture plural frames of imagedata, and composite different aspects of the data together (usingsoftware provided, e.g., by FotoNation, Inc., now Tessera Technologies,Inc.). For example, the device may take three exposures—one exposed tooptimize appearance of faces detected in the image frame, anotherexposed in accordance with the background, and other exposed inaccordance with the foreground. These are melded together to create apleasing montage. (In another example, the camera captures a burst offrames and, in each, determines whether persons are smiling or blinking.It may then select different faces from different frames to yield afinal image.)

Thus, the distinction between video and still imagery is no longersimply a device modality, but rather is becoming a user modality.

Detection of motion can be accomplished in the spatial domain (e.g., byreference to movement of feature pixels between frames), or in atransform domain. Fourier transform and DCT data are exemplary. Thesystem may extract the transform domain signature of an image component,and track its movement across different frames—identifying its motion.One illustrative technique deletes, e.g., the lowest N frequencycoefficients—leaving just high frequency edges, etc. (The highest Mfrequency coefficients may be disregarded as well.) A thresholdingoperation is performed on the magnitudes of the remainingcoefficients—zeroing those below a value (such as 30% of the mean). Theresulting coefficients serve as the signature for that image region.(The transform may be based, e.g., on tiles of 8×8 pixels.) When apattern corresponding to this signature is found at a nearby locationwithin another (or the same) image frame (using known similaritytesting, such as correlation), movement of that image region can beidentified.

Image Conveyance of Semantic Information

In many systems it is desirable to perform a set of processing steps(like those detailed above) that extract information about the incomingcontent (e.g., image data) in a scalable (e.g., distributed) manner.This extracted information (metadata) is then desirably packaged tofacilitate subsequent processing (which may be application specific, ormore computationally intense, and can be performed within theoriginating device or by a remote system).

A rough analogy is user interaction with Google. Bare search termsaren't sent to a Google mainframe, as if from a dumb terminal. Instead,the user's computer formats a query as an HTTP request, including theinternet protocol address of the originating computer (indicative oflocation), and makes available cookie information by which user languagepreferences, desired safe search filtering, etc., can be discerned. Thisstructuring of relevant information serves as a precursor to Google'ssearch process, allowing Google to perform the search process moreintelligently—providing faster and better results to the user.

FIG. 61 shows some of the metadata that may be involved in an exemplarysystem. The left-most column of information types may be computeddirectly from the native image data signals taken from the image sensor.(As noted, some or all of these can be computed using processingarrangements integrated with the sensor on a common substrate.)Additional information may be derived by reference to these basic datatypes, as shown by the second column of information types. This furtherinformation may be produced by processing in the cell phone, or externalservices can be employed (e.g., the OCR recognition service shown inFIG. 57 can be within the cell phone, or can be a remote server, etc.;similarly with the operations shown in FIG. 50.).

How can this information be packaged to facilitate subsequentprocessing? One alternative is to convey it in the “alpha” channel ofcommon image formats.

Most image formats represent imagery by data conveyed in pluralchannels, or byte-planes. In RGB, for example, one channel conveys redluminance, a second conveys green luminance, and a third conveys blueluminance. Similarly with CMYK (the channels respectively conveyingcyan, magenta, yellow, and black information) Ditto with YUV—commonlyused with video (a luma, or brightness, channel: Y, and two colorchannels: U and V), and LAB (also brightness, with two color channels).

These imaging constructs are commonly extended to include an additionalchannel: alpha. The alpha channel is provided to convey opacityinformation—indicating the extent to which background subjects arevisible through the imagery.

While commonly supported by image processing file structures, softwareand systems, the alpha channel is not much used (except, most notably,in computer generated imagery and radiology). Certain implementations ofthe present technology use the alpha channel to transmit informationderived from image data.

The different channels of image formats commonly have the same size andbit-depth. In RGB, for example, the red channel may convey 8-bit data(allowing values of 0-255 to be represented), for each pixel in a640×480 array. Likewise with the green and blue channels. The alphachannel in such arrangements is also commonly 8 bits, and co-extensivewith the image size (e.g., 8 bits×640×480). Every pixel thus has a redvalue, a green value, a blue value, and an alpha value. (The compositeimage representation is commonly known as RGBA.)

A few of the many ways the alpha channel can be used to conveyinformation derived from the image data are shown in FIGS. 62-71, anddiscussed below.

FIG. 62 shows a picture that a user may snap with a cell phone. Aprocessor in the cell phone (on the sensor substrate or elsewhere) mayapply an edge detection filter (e.g., a Sobel filter) to the image data,yielding an edge map. Each pixel of the image is either determined to bepart of an edge, or not. So this edge information can be conveyed injust one bit plane of the eight bit planes available in the alphachannel. Such an alpha channel payload is shown in FIG. 63.

The cell phone camera may also apply known techniques to identify faceswithin the image frame. The red, green and blue image data from pixelscorresponding to facial regions can be combined to yield a grey-scalerepresentation, and this representation can be included in the alphachannel—e.g., in aligned correspondence with the identified faces in theRGB image data. An alpha channel conveying both edge information andgreyscale faces is shown in FIG. 64. (An 8-bit greyscale is used forfaces in the illustrated embodiment, although a shallower bit-depth,such as 6- or 7-bits, can be used in other arrangements—freeing otherbit planes for other information.)

The camera may also perform operations to locate the positions of theeyes and mouth in each detected face. Markers can be transmitted in thealpha channel—indicating the scale and positions of these detectedfeatures. A simple form of marker is a “smiley face” bit-mapped icon,with the eyes and mouth of the icon located at the positions of thedetected eyes and mouth. The scale of the face can be indicated by thelength of the iconic mouth, or by the size of a surrounding oval (or thespace between the eye markers). The tilt of the face can be indicated bythe angle of the mouth (or the angle of the line between the eyes, orthe tilt of a surrounding oval).

If the cell phone processing yields a determination of the genders ofpersons depicted in the image, this too can be represented in the extraimage channel. For example, an oval line circumscribing the detectedface of a female may be made dashed or otherwise patterned. The eyes maybe represented as cross-hairs or Xs instead of blackened circles, etc.Ages of depicted persons may also be approximated, and indicatedsimilarly. The processing may also classify each person's emotionalstate by visual facial clues, and an indication such assurprise/happiness/sadness/anger/neutral can be represented. (See, e.g.,Su, “A simple approach to facial expression recognition,” Proceedings ofthe 2007 Int'l Conf on Computer Engineering and Applications,Queensland, Australia, 2007, pp. 456-461. See also patent publications20080218472 (Emotiv Systems, Pty), and 20040207720 (NTT DoCoMo)).

When a determination has some uncertainty (such as guessing gender, agerange, or emotion), a confidence metric output by the analysis processcan also be represented in an iconic fashion, such as by the width ofthe line, or the scale or selection of pattern elements.

FIG. 65 shows different pattern elements that can be used to denotedifferent information, including gender and confidence, in an auxiliaryimage plane.

The portable device may also perform operations culminating in opticalcharacter recognition of alphanumeric symbols and strings depicted inthe image data. In the illustrated example, the device may recognize thestring “LAS VEGAS” in the picture. This determination can bememorialized by a PDF417 2D barcode added to the alpha channel. Thebarcode can be in the position of the OCR'd text in the image frame, orelsewhere.

(PDF417 is exemplary only. Other barcodes—such as 1D, Aztec, Datamatrix,High Capacity Color Barcode, Maxicode, QR Code, Semacode, andShotCode—or other machine-readable data symbologies—such as OCR fontsand data glyphs—can naturally be used. Glyphs can be used both to conveyarbitrary data, and also to form halftone image depictions. See in thisregard Xerox's U.S. Pat. No. 6,419,162, and Hecht, “Printed EmbeddedData Graphical User Interfaces,” IEEE Computer Magazine, Vol. 34, No. 3,2001, pp 47-55.)

FIG. 66 shows an alpha channel representation of some of the informationdetermined by the device. All of this information is structured in amanner that allows it to be conveyed within just a single bit plane (ofthe eight bit planes) of the alpha channel. Information resulting fromother of the processing operations (e.g., the analyses shown in FIGS. 50and 61) may be conveyed in this same bit plane, or in others.

While FIGS. 62-66 showed a variety of information that can be conveyedin the alpha channel, and different representations of same, still moreare shown in the example of FIGS. 67-69. These involve a cell phonepicture of a new GMC truck and its owner.

Among other processing, the cell phone in this example processed theimage data to recognize the model, year and color of the truck,recognize the text on the truck grill and the owner's t-shirt, recognizethe owner's face, and recognize areas of grass and sky.

The sky was recognized by its position at the top of the frame, itscolor histogram within a threshold distance of expected norms, and aspectral composition weak in certain frequency coefficients (e.g., asubstantially “flat” region). The grass was recognized by its textureand color. (Other techniques for recognizing these features are taught,e.g., in Batlle, “A review on strategies for recognizing natural objectsin colour images of outdoor scenes,” Image and Vision Computing, Volume18, Issues 6-7, 1 May 2000, pp. 515-530; Hayashi, “Fast Labelling ofNatural Scenes Using Enhanced Knowledge,” Pattern Analysis &Applications, Volume 4, Number 1, March, 2001, pp. 20-27; and Boutell,“Improved semantic region labeling based on scene context,” IEEE Int'lConf. on Multimedia and Expo, July, 2005. See also patent publications20050105776 and 20050105775 (Kodak).) The trees could have beensimilarly recognized.

The human face in the image was detected using arrangements like thosecommonly employed in consumer cameras. Optical character recognition wasperformed on a data set resulting from application of an edge detectionalgorithm to the input image, followed by Fourier and Mellin transforms.(While finding the text GMC and LSU TIGERS, the algorithm failed toidentify other text on the t-shirt, and text on the tires. Withadditional processing time, some of this missing text may have beendecoded.)

The truck was first classed as a vehicle, and then as a truck, and thenfinally identified as a Dark Crimson Metallic 2007 GMC Siena Z-71 withextended cab, by pattern matching. (This detailed identification wasobtained through use of known reference truck images, from resourcessuch as the GM trucks web site, Filch, and a fan site devoted toidentifying vehicles in Hollywood motion pictures: IMCDB-dot-com.Another approach to make and model recognition is detailed in Zafar,“Localized Contourlet Features in Vehicle Make and Model Recognition,”Proc. SPIE, Vol. 7251, 725105, 2009.)

FIG. 68 shows an illustrative graphical, bitonal representation of thediscerned information, as added to the alpha channel of the FIG. 67image. (FIG. 69 shows the different planes of the composite image: red,green, blue, and alpha.)

The portion of the image area detected as depicting grass is indicatedby a uniform array of dots. The image area depicting sky is representedas a grid of lines. (If trees had been particularly identified, theycould have been labeled using one of the same patterns, but withdifferent size/spacing/etc. Or an entirely different pattern could havebeen used.)

The identification of the truck as a Dark Crimson Metallic 2007 GMCSierra Z-71 with extended cab is encoded in a PDF417 2D barcode—scaledto the size of the truck and masked by its shape. Because PDF417 encodesinformation redundantly, with error-correction features, the portions ofthe rectangular barcode that are missing do not prevent the encodedinformation from being recovered.

The face information is encoded in a second PDF417 barcode. This secondbarcode is oriented at 90 degrees relative to the truck barcode, and isscaled differently, to help distinguish the two distinct symbols todownstream decoders. (Other different orientations could be used, and insome cases are preferable, e.g., 30 degrees, 45 degrees, etc.)

The facial barcode is oval in shape, and may be outlined with an ovalborder (although this is not depicted). The center of the barcode isplaced at the mid-point of the person's eyes. The width of the barcodeis twice the distance between the eyes. The height of the oval barcodeis four times the distance between the mouth and a line joining theeyes.

The payload of the facial barcode conveys information discerned from theface. In rudimentary embodiments, the barcode simply indicates theapparent presence of a face. In more sophisticated embodiments,eigenvectors computed from the facial image can be encoded. If aparticular face is recognized, information identifying the person can beencoded. If the processor makes a judgment about the likely gender ofthe subject, this information can be conveyed in the barcode too.

Persons appearing in imagery captured by consumer cameras and cellphones are not random: a significant percentage are of recurringsubjects, e.g., the owner's children, spouse, friends, the userhimself/herself, etc. There are often multiple previous images of theserecurring subjects distributed among devices owned or used by the owner,e.g., PDA, cell phone, home computer, network storage, etc. Many ofthese images are annotated with names of the persons depicted. From suchreference images, sets of characterizing facial vectors can be computed,and used to identify subjects in new photos. (As noted, Google's Picasaservice works on this principle to identify persons in a user's photocollection; Facebook and iPhoto do likewise.) Such a library ofreference facial vectors can be checked to try and identify the persondepicted in the FIG. 67 photograph, and the identification can berepresented in the barcode. (The identification can comprise theperson's name, and/or other identifier(s) by which the matched face isknown, e.g., an index number in a database or contact list, a telephonenumber, a FaceBook user name, etc.)

Text recognized from regions of the FIG. 67 image is added tocorresponding regions of the alpha channel frame, presented in areliably decodable OCR font. (OCR-A is depicted although other fonts maybe used.)

A variety of further information may be included in the FIG. 68 alphachannel. For example, locations in the frame where a processor suspectstext is present, but OCRing did not successfully decode alphanumericsymbols (on the tires perhaps, or other characters on the person'sshirt), can be identified by adding a corresponding visual clue (e.g., apattern of diagonal lines). An outline of the person (rather than justan indication of his face) can also be detected by a processor, andindicated by a corresponding border or fill pattern.

While the examples of FIGS. 62-66 and FIGS. 67-69 show various differentways of representing semantic metadata in the alpha channel, still moretechniques are shown in the example of FIGS. 70-71. Here a user hascaptured a snapshot of a child at play (FIG. 70).

The child's face is turned away from the camera, and is captured withpoor contrast. However, even with this limited information, theprocessor makes a likely identification by referring to the user'sprevious images: the user's firstborn child Matthew Doe (who seems to befound in countless of the user's archived photos).

As shown in FIG. 71, the alpha channel in this example conveys anedge-detected version of the user's image. Superimposed over the child'shead is a substitute image of the child's face. This substitute imagecan be selected for its composition (e.g., depicting two eyes, nose andmouth) and better contrast.

In some embodiments, each person known to the system has an iconicfacial image that serves as a visual proxy for the person in differentcontexts. For example, some PDAs store contact lists that include facialimages of the contacts. The user (or the contacts) provides facialimages that are easily recognized—iconic. These iconic facial images canbe scaled to match the head of the person depicted in an image, andadded to the alpha channel at the corresponding facial location.

Also included in the alpha channel depicted in FIG. 71 is a 2D barcode.This barcode can convey other of the information discerned fromprocessing of the image data or otherwise available (e.g., the child'sname, a color histogram, exposure metadata, how many faces were detectedin the picture, the ten largest DCT or other transform coefficients,etc.).

To make the 2D barcode as robust as possible to compression and otherimage processing operations, its size may not be fixed, but rather isdynamically scaled based on circumstances—such as image characteristics.In the depicted embodiment, the processor analyzes the edge map toidentify regions with uniform edginess (i.e., within a thresholdedrange). The largest such region is selected. The barcode is then scaledand placed to occupy a central area of this region. (In subsequentprocessing, the edginess where the barcode was substituted can belargely recovered by averaging the edginess at the center pointsadjoining the four barcode sides.)

In another embodiment, region size is tempered with edginess indetermining where to place a barcode: low edginess is preferred. In thisalternative embodiment, a smaller region of lower edginess may be chosenover a larger region of higher edginess. The size of each candidateregion, minus a scaled value of edginess in the region, can serve as ametric to determine which region should host the barcode. This is thearrangement used in FIG. 71, resulting in placement of the barcode in aregion to the left of Matthew's head—rather than in a larger, butedgier, region to the right.

Although the FIG. 70 photo is relatively “edgy” (as contrasted, e.g.,with the FIG. 62 photo), much of the edginess may be irrelevant. In someembodiments the edge data is filtered to preserve only the principaledges (e.g., those indicated by continuous line contours). Withinotherwise vacant regions of the resulting filtered edge map a processorcan convey additional data. In one arrangement the processor inserts apattern to indicate a particular color histogram bin into which thatregion's image colors fall. (In a 64-bin histogram, requiring 64different patterns, bin 2 may encompass colors in which the red channelhas values of 0-63, the green channel has values of 0-63, and the bluechannel has a values of 64-127, etc.) Other image metrics can similarlybe conveyed.

Instead of using different patterns to indicate different data, vacantregions in a filtered edge map can be filled with a noise-likesignal—steganographically encoded to convey histogram (or otherinformation) as digital watermark data. (A suitable watermarkingtechnology is detailed in Digimarc's U.S. Pat. No. 6,590,996.)

It will be recognized that some of the information in the alphachannel—if visually presented to a human in a graphical form, conveysuseful information. From FIG. 63 a human can distinguish a man embracinga woman, in front of a sign stating “WELCOME TO Fabulous LAS VEGASNEVADA.” From FIG. 64 the human can see greyscale faces, and an outlineof the scene. From FIG. 66 the person can additionally identify abarcode conveying some information, and can identify two smiley faceicons showing the positions of faces.

Likewise, a viewer to whom the frame of graphical information in FIG. 68is rendered can identify an outline of a person, can read the LSU TIGERSfrom the person's shirt, and make out what appears to be the outline ofa truck (aided by the clue of the GMC text where the truck's grill wouldbe).

From presentation of the FIG. 71 alpha channel data a human can identifya child sitting on the floor, playing with toys.

The barcode in FIG. 71, like the barcode in FIG. 66, conspicuouslyindicates to an inspecting human the presence of information, albeit notits content.

Other of the graphical content in the alpha channel may not beinformative to a human upon inspection. For example, if the child's nameis steganographically encoded as a digital watermark in a noise-likesignal in FIG. 71, even the presence of information in that noise may goundetected by the person.

The foregoing examples detail some of the diversity of semanticinformation that can be stuffed into the alpha channel, and thediversity of representation constructs that can be employed. Of course,this is just a small sampling; the artisan can quickly adapt theseteachings to the needs of particular applications, yielding many other,different embodiments. Thus, for example, any of the information thatcan be extracted from an image can be memorialized in the alpha channelusing arrangements akin to those disclosed herein.

It will be recognized that information relating to the image can beadded to the alpha channel at different times, by different processors,at different locations. For example, the sensor chip in a portabledevice may have on-chip processing that performs certain analyses, andadds resulting data to the alpha channel. The device may have anotherprocessor that performs further processing—on the image data and/or onthe results of the earlier analyses—and adds a representation of thosefurther results to the alpha channel. (These further results may bebased, in part, on data acquired wirelessly from a remote source. Forexample, a consumer camera may link by Bluetooth to the user's PDA, toobtain facial information from the user's contact files.)

The composite image file may be transmitted from the portable device toan intermediate network node (e.g., at a carrier such as Verizon, AT&T,or T-Mobile, or at another service provider), which performs additionalprocessing, and adds its results to the alpha channel. (With its morecapable processing hardware, such an intermediate network node canperform more complex, resource-intensive processing—such as moresophisticated facial recognition and pattern matching. With itshigher-bandwidth network access, such a node can also employ a varietyof remote resources to augment the alpha channel with additional data, eg, links to Wikipedia entries—or Wikipedia content itself, informationfrom telephone database and image database lookups, etc.) Thethus-supplemented image may then be forwarded to an image query serviceprovider (e.g., SnapNow, MobileAcuity, etc.), which can continue theprocess and/or instruct a responsive action based on the informationthus-provided.

The alpha channel may thus convey an iconic view of what all precedingprocessing has discerned or learned about the image. Each subsequentprocessor can readily access this information, and contribute stillmore. All this within the existing workflow channels and constraints oflong-established file formats.

In some embodiments, the provenance of some or all of thediscerned/inferred data is indicated. For example, stored data mayindicate that OCRing which yielded certain text was performed by aVerizon server having a unique identifier, such as MAC address of01-50-F3-83-AB-CC or network identifierPDX-LA002290.corp.verizon-dot-com, on Aug. 28, 2008, 8:35 pm. Suchinformation can be stored in the alpha channel, in header data, in aremote repository to which a pointer is provided, etc.

Different processors may contribute to different bit-planes of the alphachannel. A capture device may write its information to bit plane #1. Anintermediate node may store its contributions in bit plane #2. Etc.Certain bit planes may be available for shared use.

Or different bit planes may be allocated for different classes or typesof semantic information. Information relating to faces or persons in theimage may always be written to bit plane #1. Information relating toplaces may always be written to bit plane #2. Edge map data may alwaysbe found in bit plane #3, together with color histogram data (e.g.,represented in 2D barcode form). Other content labeling (e.g., grass,sand, sky) may be found in bit plane #4, together with OCR'd text.Textual information, such as related links or textual content obtainedfrom the web may be found in bit plane #5. (ASCII symbols may beincluded as bit patterns, e.g., with each symbol taking 8 bits in theplane. Robustness to subsequent processing can be enhanced by allocating2 or more bits in the image plane for each bit of ASCII data.Convolutional coding and other error correcting technologies can also beemployed for some or all of the image plan information. So, too, canerror correcting barcodes.)

An index to the information conveyed in the alpha channel can becompiled, e.g., in an EXIF header associated with the image, allowingsubsequent systems to speed their interpretation and processing of suchdata. The index can employ XML-like tags, specifying the types of dataconveyed in the alpha channel, and optionally other information (e.g.,their locations).

Locations can be specified as the location of the upper-most bit (orupper-left-most bit) in the bit-plane array, e.g., by X-, Y-coordinates.Or a rectangular bounding box can be specified by reference to twocorner points (e.g., specified by X,Y coordinates)—detailing the regionwhere information is represented.

In the example of FIG. 66, the index may convey information such as

<MaleFace1> AlphaBitPlane1 (637,938) </MaleFace1> <FemaleFace1>AlphaBitPlane1 (750,1012) </FemaleFace1> <OCRTextPDF417> AlphaBitPlane1(75,450)-(1425,980) </OCRTextPDF417> <EdgeMap> AlphaBitPlane1 </EdgeMap>

This index thus indicates that a male face is found in bit plane #1 ofthe alpha channel, with a top pixel at location (637,938); a female faceis similarly present with a top pixel located at (750,1012); OCR'd textencoded as a PDF417 barcode is found in bit plane #1 in the rectangulararea with corner points (75,450) and (1425,980), and that bit plane #1also includes an edge map of the image.

More or less information can naturally be provided. A different form ofindex, with less information, may specify, e.g.:

<AlphaBitPlanel>Face,Face,PDF417,EdgeMap </AlphaBitPlanel>

This form of index simply indicates that bit plane #1 of the alphachannel includes 2 faces, a PDF417 barcode, and an edge map.

An index with more information may specify data including the rotationangle and scale factor for each face, the LAS VEGAS payload of thePDF417 barcode, the angle of the PDF417 barcode, the confidence factorsfor subjective determinations, names of recognized persons, a lexicon orglossary detailing the semantic significance of each pattern used in thealpha channels (e.g., the patterns of FIG. 65, and the graphical labelsused for sky and grass in FIG. 68), the sources of auxiliary data (e.g.,of the superimposed child's face in FIG. 71, or the remote referenceimage data that served as basis for the conclusion that the truck inFIG. 67 is a Sierra Z71), etc.

As can be seen, the index can convey information that is also conveyedin the bit planes of the alpha channel. Generally, different forms ofrepresentation are used in the alpha channel's graphicalrepresentations, versus the index. For example, in the alpha channel thefemaleness of the second face is represented by the ‘+’s to representthe eyes; in the index the femaleness is represented by the XML tag<FemaleFace1>. Redundant representation of information can serve as acheck on data integrity.

Sometimes header information, such as EXIF data, becomes separated fromthe image data (e.g., when the image is converted to a differentformat). Instead of conveying index information in a header, a bit planeof the alpha channel can serve to convey the index information, e.g.,bit plane #1. One such arrangement encodes the index information as a 2Dbarcode. The barcode may be scaled to fill the frame, to provide maximumrobustness to possible image degradation.

In some embodiments, some or all of the index information is replicatedin different data stores. For example, it may be conveyed both in EXIFheader form, and as a barcode in bit plane #1. Some or all of the datamay also be maintained remotely, such as by Google, or other web storage“in the cloud.” Address information conveyed by the image can serve as apointer to this remote storage. The pointer (which can be a URL, butmore commonly is a UID or index into a database which—whenqueried—returns the current address of the sought-for data) can beincluded within the index, and/or in one or more of the bit planes ofthe alpha channel. Or the pointer can be steganographically encodedwithin the pixels of the image data (in some or all of the compositeimage planes) using digital watermarking technology.

In still other embodiments, some or all the information described aboveas stored in the alpha channel can additionally, or alternatively, bestored remotely, or encoded within the image pixels as a digitalwatermark. (The picture itself, with or without the alpha channel, canalso be replicated in remote storage, by any device in the processingchain.)

Some image formats include more than the four planes detailed above.Geospatial imagery and other mapping technologies commonly representdata with formats that extend to a half-dozen or more informationplanes. For example, multispectral space-based imagery may have separateimage planes devoted to (1) red, (2) green, (3) blue, (4) near infrared,(5) mid-infrared, (6) far infrared, and (7) thermal infrared. Thetechniques detailed above can convey derived/inferred image informationusing one or more of the auxiliary data planes available in suchformats.

As an image moves between processing nodes, some of the nodes mayoverwrite data inserted by earlier processing. Although not essential,the overwriting processor may copy the overwritten information intoremote storage, and include a link or other reference to it in the alphachannel, or index, or image—in case same later is needed.

When representing information in the alpha channel, consideration may begiven to degradations to which this channel may be subjected. JPEGcompression, for example, commonly discards high frequency details thatdo not meaningfully contribute to a human's perception of an image. Suchdiscarding of information based on the human visual system, however, canwork to disadvantage when applied to information that is present forother purposes (although human viewing of the alpha channel is certainlypossible and, in some cases, useful).

To combat such degradation, the information in the alpha channel can berepresented by features that would not likely be regarded as visuallyirrelevant. Different types of information may be represented bydifferent features, so that the most important persist through evensevere compression. Thus, for example, the presence of faces in FIG. 66are signified by bold ovals. The locations of the eyes may be lessrelevant, so are represented by smaller features. Patterns shown in FIG.65 may not be reliably distinguished after compression, and so might bereserved to represent secondary information—where loss is lessimportant. With JPEG compression, the most-significant bit-plane is bestpreserved, whereas lesser-significant bit-planes are increasinglycorrupted. Thus, the most important metadata should be conveyed in themost-significant bit planes of the alpha channel—to enhancesurvivability.

If technology of the sort illustrated by FIGS. 62-71 becomes a linguafranca for conveying metadata, image compression might evolve to takeits presence into account. For example, JPEG compression may be appliedto the red, green and blue image channels, but lossless (or less lossy)compression may be applied to the alpha channel. Since the various bitplanes of the alpha channel may convey different information, they maybe compressed separately—rather than as bytes of 8-bit depth. (Ifcompressed separately, lossy compression may be more acceptable.) Witheach bit-plane conveying only bitonal information, compression schemesknown from facsimile technology can be used, including Modified Huffman,Modified READ, run length encoding, and ITU-T T.6. Hybrid compressiontechniques are thus well-suited for such files.

Alpha channel conveyance of metadata can be arranged to progressivelytransmit and decode in general correspondence with associated imageryfeatures, when using compression arrangements such as JPEG 2000. Thatis, since the alpha channel is presenting semantic information in thevisual domain (e.g., iconically), it can be represented so that layersof semantic detail decompress at the same rate as the image.

In JPEG 2000, a wavelet transform is used to generate data representingthe image. JPEG 2000 packages and processes this transform data in amanner yielding progressive transmission and decoding. For example, whenrendering a JPEG 2000 image, the gross details of the image appearfirst, with successively finer details following. Similarly withtransmission.

Consider the truck & man image of FIG. 67. Rendering a JPEG 2000 versionof this image would first present the low frequency, bold form of thetruck. Thereafter the shape of the man would appear. Next, features suchas the GMC lettering on the truck grill, and the logo on the man'st-shirt would be distinguished. Finally, the detail of the man's facialfeatures, the grass, the trees, and other high frequency minutiae wouldcomplete the rendering of the image. Similarly with transmission.

This progression is shown in the pyramid of FIG. 77A. Initially arelatively small amount of information is presented—giving gross shapedetails. Progressively the image fills in—ultimately ending with arelatively large amount of small detail data.

The information in the alpha channel can be arranged similarly (FIG.77B). Information about the truck can be represented with a large, lowfrequency (shape-dominated) symbology. Information indicating thepresence and location of the man can be encoded with anext-most-dominant representation. Information corresponding to the GMClettering on the truck grill, and lettering on the man's shirt, can berepresented in the alpha channel with a finer degree of detail. Thefinest level of salient detail in the image, e.g., the minutiae of theman's face, can be represented with the finest degree of detail in thealpha channel. (As may be noted, the illustrative alpha channel of FIG.68 doesn't quite follow this model.)

If the alpha channel conveys its information in the form ofmachine-readable symbologies (e.g., barcodes, digital watermarks,glyphs, etc.), the order of alpha channel decoding can bedeterministically controlled. Features with the largest features aredecoded first; those with the finest features are decoded last. Thus,the alpha channel can convey barcodes at several different sizes (all inthe same bit frame, e.g., located side-by-side, or distributed among bitframes). Or the alpha channel can convey plural digital watermarksignals, e.g., one at a gross resolution (e.g., corresponding to 10watermark elements, or “waxels” to the inch), and others at successivelyfiner resolutions (e.g., 50, 100, 150 and 300 waxels per inch). Likewisewith data glyphs: a range of larger and smaller sizes of glyphs can beused, and they will decode relatively earlier or later.

(JPEG2000 is the most common of the compression schemes exhibitingprogressive behavior, but there are others. JPEG, with some effort, canbehave similarly. The present concepts are applicable whenever suchprogressivity exists.)

By such arrangements, as image features are decoded for presentation—ortransmitted (e.g., by streaming media delivery), the correspondingmetadata becomes available.

It will be recognized that results contributed to the alpha channel bythe various distributed processing nodes are immediately available toeach subsequent recipient of the image. A service provider receiving aprocessed image, for example, thus quickly understands that FIG. 62depicts a man and a woman in Las Vegas; that FIG. 63 shows a man and hisGMC truck; and that the FIG. 70 image shows a child named Matthew Doe.Edge map, color histogram, and other information conveyed with theseimages gives the service provider a headstart in its processing of theimagery, e.g., to segment it; recognize its content, initiating anappropriate response, etc.

Receiving nodes can also use the conveyed data to enhance stored profileinformation relating to the user. A node receiving the FIG. 66 metadatacan note Las Vegas as a location of potential interest. A systemreceiving the FIG. 68 metadata can infer that GMC Z71 trucks arerelevant to the user, and/or to the person depicted in that photo. Suchassociations can serve as launch points for tailored user experiences.

The metadata also allows images with certain attributes to be identifiedquickly, in response to user queries. (E.g., find pictures showing GMCSierra Z71 trucks.) Desirably, web-indexing crawlers can check the alphachannels of images they find on the web, and add information from thealpha channel to the compiled index to make the image more readilyidentifiable to searchers.

As noted, an alpha channel-based approach is not essential for use ofthe technologies detailed in this specification. Another alternative isa data structure indexed by coordinates of image pixels. The datastructure can be conveyed with the image file (e.g., as EXIF headerdata), or stored at a remote server.

For example, one entry in the data structure corresponding to pixel(637,938) in FIG. 66 may indicate that the pixel forms part of a male'sface. A second entry for this pixel may point to a shared sub-datastructure at which eigenface values for this face are stored. (Theshared sub-data structure may also list all the pixels associated withthat face.) A data record corresponding to pixel (622,970) may indicatethe pixel corresponds to the left-side eye of the male's face. A datarecord indexed by pixel (155,780) may indicate that the pixel forms partof text recognized (by OCRing) as the letter “L”, and also falls withincolor histogram bin 49, etc. The provenance of each datum of informationmay also be recorded.

(Instead of identifying each pixel by X- and Y-coordinates, each pixelmay be assigned a sequential number by which it is referenced.)

Instead of several pointers pointing to a common sub-data structure fromdata records of different pixels, the entries may form a linked list, inwhich each pixel includes a pointer to a next pixel with a commonattribute (e.g., associated with the same face). A record for a pixelmay include pointers to plural different sub-data structures, or toplural other pixels—to associate the pixel with plural different imagefeatures or data.

If the data structure is stored remotely, a pointer to the remote storecan be included with the image file, e.g., steganographically encoded inthe image data, expressed with EXIF data, etc. If any watermarkingarrangement is used, the origin of the watermark (see Digimarc's U.S.Pat. No. 6,307,949) can be used as a base from which pixel referencesare specified as offsets (instead of using, e.g., the upper left cornerof the image). Such an arrangement allows pixels to be correctlyidentified despite corruptions such as cropping, or rotation.

As with alpha channel data, the metadata written to a remote store isdesirably available for search. A web crawler encountering the image canuse the pointer in the EXIF data or the steganographically encodedwatermark to identify a corresponding repository of metadata, and addmetadata from that repository to its index terms for the image (despitebeing found at different locations).

By the foregoing arrangements it will be appreciated that existingimagery standards, workflows, and ecosystems—originally designed tosupport just pixel image data, are here employed in support of metadataas well.

(Of course, the alpha channel and other approaches detailed in thissection are not essential to other aspects of the present technology.For example, information derived or inferred from processes such asthose shown in FIGS. 50, 57 and 61 can be sent by other transmissionarrangements, e.g., dispatched as packetized data using WiFi or WiMax,transmitted from the device using Bluetooth, sent as an SMS short textor MMS multimedia messages, shared to another node in a low powerpeer-to-peer wireless network, conveyed with wireless cellulartransmission or wireless data service, etc.)

Texting, Etc.

U.S. Pat. No. 5,602,566 (Hitachi), U.S. Pat. No. 6,115,028 (SiliconGraphics), U.S. Pat. No. 6,201,554 (Ericsson), U.S. Pat. No. 6,466,198(Innoventions), U.S. Pat. No. 6,573,883 (Hewlett-Packard), U.S. Pat. No.6,624,824 (Sun) and U.S. Pat. No. 6,956,564 (British Telecom), andpublished PCT application WO9814863 (Philips), teach that portablecomputers can be equipped with devices by which tilting can be sensed,and used for different purposes (e.g., scrolling through menus).

In accordance with another aspect of the present technology, a tip/tiltinterface is used in connection with a typing operation, such as forcomposing text messages sent by a Simple Message Service (SMS) protocolfrom a PDA, a cell phone, or other portable wireless device.

In one embodiment, a user activates a tip/tilt text entry mode using anyof various known means (e.g., pushing a button, entering a gesture,etc.). A scrollable user interface appears on the device screen,presenting a series of icons. Each icon has the appearance of a cellphone key, such as a button depicting the numeral “2” and the letters“abc.” The user tilts the device left or right to scroll backwards orforwards thru the series of icons, to reach a desired button. The userthen tips the device towards or away from themselves to navigate betweenthe three letters associated with that icon (e.g., tipping awaynavigates to “a;” no tipping corresponds to “b;” and tipping towardsnavigates to “c”). After navigating to the desired letter, the usertakes an action to select that letter. This action may be pressing abutton on the device (e.g., with the user's thumb), or another actionmay signal the selection. The user then proceeds as described to selectsubsequent letters. By this arrangement, the user enters a series oftext without the constraints of big fingers on tiny buttons or UIfeatures.

Many variations are, of course, possible. The device needn't be a phone;it may be a wristwatch, keyfob, or have another small form factor.

The device may have a touch-screen. After navigating to a desiredcharacter, the user may tap the touch screen to effect the selection.When tipping/tilting the device, the corresponding letter can bedisplayed on the screen in an enlarged fashion (e.g., on the iconrepresenting the button, or overlaid elsewhere) to indicate the user'sprogress in navigation.

While accelerometers or other physical sensors are employed in certainembodiments, others use a 2D optical sensor (e.g., a camera). The usercan point the sensor to the floor, to a knee, or to another subject, andthe device can then sense relative physical motion by sensing movementof features within the image frame (up/down; left right). In suchembodiments the image frame captured by the camera need not be presentedon the screen; the symbol selection UI, alone, may be displayed. (Or,the UI can be presented as an overlay on the background image capturedby the camera.)

In camera-based embodiments, as with embodiments employing physicalsensors, another dimension of motion may also be sensed: up/down. Thiscan provide an additional degree of control (e.g., shifting to capitalletters, or shifting from characters to numbers, or selecting thecurrent symbol, etc).

In some embodiments, the device has several modes: one for enteringtext; another for entering numbers; another for symbols; etc. The usercan switch between these modes by using mechanical controls (e.g.,buttons), or through controls of a user interface (e.g., touches orgestures or voice commands). For example, while tapping a first regionof the screen may select the currently-displayed symbol, tapping asecond region of the screen may toggle the mode between character-entryand numeric-entry. Or one tap in this second region can switch tocharacter-entry (the default); two taps in this region can switch tonumeric-entry; and three taps in this region can switch to entry ofother symbols.

Instead of selecting between individual symbols, such an interface canalso include common words or phrases (e.g., signature blocks) to whichthe user can tip/tilt navigate, and then select. There may be severallists of words/phrases. For example, a first list may be standardized(pre-programmed by the device vendor), and include statistically commonwords. A second list may comprise words and/or phrases that areassociated with a particular user (or a particular class of users). Theuser may enter these words into such a list, or the device can compilethe list during operation—determining which words are most commonlyentered by the user. (The second list may exclude words found on thefirst list, or not.) Again, the user can switch between these lists asdescribed above.

Desirably, the sensitivity of the tip/tilt interface is adjustable bythe user, to accommodate different user preferences and skills.

While the foregoing embodiments contemplated a limited grammar oftilts/tips, more expansive grammars can be devised. For example, whilerelative slow tilting of the screen to the left may cause the icons toscroll in a given direction (left, or right, depending on theimplementation), a sudden tilt of the screen in that direction caneffect a different operation—such as inserting a line (or paragraph)break in the text. A sharp tilt in the other direction can cause thedevice to send the message.

Instead of the speed of tilt, the degree of tilt can correspond todifferent actions. For example, tilting the device between 5 and 25degrees can cause the icons to scroll, but tilting the device beyond 30degrees can insert a line break (if to the left) or can cause themessage to be sent (if to the right).

Different tip gestures can likewise trigger different actions.

The arrangements just described are necessarily only a few of the manydifferent possibilities. Artisans adopting such technology are expectedto modify and adapt these teachings as suited for particularapplications.

Affine Capture Parameters

In accordance with another aspect of the present technology, a portabledevice captures—and may present—geometric information relating to thedevice's position (or that of a subject).

Digimarc's published patent application 20080300011 teaches variousarrangements by which a cell phone can be made responsive to what it“sees,” including overlaying graphical features atop certain imagedobjects. The overlay can be warped in accordance with the object'sperceived affine distortion.

Steganographic calibration signals by which affine distortion of animaged object can be accurately quantified are detailed, e.g., inDigimarc's U.S. Pat. Nos. 6,614,914 and 6,580,809; and in patentpublications 20040105569, 20040101157, and 20060031684. Digimarc's U.S.Pat. No. 6,959,098 teaches how distortion can be characterized by suchwatermark calibration signals in conjunction with visible image features(e.g., edges of a rectilinear object). From such affine distortioninformation, the 6D location of a watermarked object relative to theimager of a cell phone can be determined.

There are various ways 6D location can be described. One is by threelocation parameters: x, y, z, and three angle parameters: tip, tilt,rotation. Another is by rotation and scale parameters, together with a2D matrix of 4 elements that defines a linear transformation (e.g.,shear mapping, translation, etc.). The matrix transforms the location ofany pixel x,y to a resultant location after a linear transform hasoccurred. (The reader is referred to references on shear mapping, e.g.,Wikipedia, for information on the matrix math, etc.)

FIG. 58 shows how a cell phone can display affine parameters (e.g.,derived from imagery or otherwise). The camera can be placed in thismode through a UI control (e.g., tapping a physical button, making atouchscreen gesture, etc.).

In the depicted arrangement, the device's rotation from (an apparent)horizontal orientation is presented at the top of the cell phone screen.The cell phone processor can make this determination by analyzing theimage data for one or more generally parallel elongated straight edgefeatures, averaging them to determine a mean, and assuming that this isthe horizon. If the camera is conventionally aligned with the horizon,this mean line will be horizontal. Divergence of this line fromhorizontal indicates the camera's rotation. This information can bepresented textually (e.g., “12 degrees right”), and/or a graphicalrepresentation showing divergence from horizontal can be presented.

(Other means for sensing angular orientation can be employed. Forexample, many cell phones include accelerometers, or other tiltdetectors, which output data from which the cell phone processor candiscern the device's angular orientation.

In the illustrated embodiment, the camera captures a sequence of imageframes (e.g., video) when in this mode of operation. A second datumindicates the angle by which features in the image frame have beenrotated since image capture began. Again, this information can begleaned by analysis of the image data, and can be presented in textform, and/or graphically. (The graphic can comprise a circle, with aline—or arrow—through the center showing real-time angular movement ofthe camera to the left or right.)

In similar fashion, the device can track changes in the apparent size ofedges, objects, and/or other features in the image, to determine theamount by which scale has changed since image capture started. Thisindicates whether the camera has moved towards or away from the subject,and by how much. Again, the information can be presented textually andgraphically. The graphical presentation can comprise two lines: areference line, and a second, parallel line whose length changes in realtime in accordance with the scale change (larger than the reference linefor movement of the camera closer to the subject, and smaller formovement away).

Although not particularly shown in the exemplary embodiment of FIG. 58,other such geometric data can also be derived and presented, e.g.,translation, differential scaling, tip angle (i.e., forward/backward),etc.

The determinations detailed above can be simplified if the camera fieldof view includes a digital watermark having steganographiccalibration/orientation data of the sort detailed in the referencedpatent documents. However, the information can also be derived fromother features in the imagery.

Of course, in still other embodiments, data from one or moreaccelerometers or other position sensing arrangements in thedevice—either alone or in conjunction with image data—can be used togenerate the presented information.

In addition to presenting such geometric information on the devicescreen, such information can also be used, e.g., in sensing gesturesmade with the device by a user, in providing context by which remotesystem responses can be customized, etc.

Camera-based Environmental and Behavioral State Machine

In accordance with a further aspect of the present technology, a cellphone functions as a state machine, e.g., changing aspects of itsfunctioning based on image-related information previously acquired. Theimage-related information can be focused on the natural behavior of thecamera user, typical environments in which the camera is operated,innate physical characteristics of the camera itself, the structure anddynamic properties of scenes being imaged by the camera, and many othersuch categories of information. The resulting changes in the camera'sfunction can be directed toward improving image analysis programsresident on a camera-device or remotely located at some image-analysisserver. Image analysis is construed very broadly, covering a range ofanalysis from digital watermark reading, to object and facialrecognition, to 2-D and 3-D barcode reading and optical characterrecognition, all the way through scene categorization analysis and more.

A few simple examples will illustrate what is expected to become animportant aspect of future mobile devices.

Consider the problem of object recognition. Most objects have differentappearances, depending on the angle from which they are viewed. If amachine vision object-recognition algorithm is given some informationabout the perspective from which an object is viewed, it can make a moreaccurate (or faster) guess of what the object is.

People are creatures of habit, including in their use of cell phonecameras. This extends to the hand in which they typically hold thephone, and how they incline it during picture taking. After a user hasestablished a history with a phone, usage patterns may be discerned fromthe images captured. For example, the user may tend to take photos ofsubjects not straight-on, but slightly from the right. Such aright-oblique tendency in perspective may be due to the fact that theuser routinely holds the camera in the right hand, so exposures aretaken from a bit right-of-center.

(Right-obliqueness can be sensed in various ways, e.g., by lengths ofvertical parallel edges within image frames. If edges tend to be longeron the right sides of the images, this tends to indicate that the imageswere taken from a right-oblique view. Differences in illumination acrossforeground subjects can also be used—brighter illumination on the rightside of subjects suggest the right side was closer to the lens. Etc.)

Similarly, in order to comfortably operate the shutter button of thephone while holding the device, this particular user may habituallyadopt a grip of the phone that inclines the top of the camera fivedegrees towards the user (i.e., to the left). This results in thecaptured image subjects generally being skewed with an apparent rotationof five degrees.

Such recurring biases can be discerned by examining a collection ofimages captured by that user with that cell phone. Once identified, datamemorializing these idiosyncrasies can be stored in a memory, and usedto optimize image recognition processes performed by the device.

Thus, the device may generate a first output (e.g., a tentative objectidentification) from a given image frame at one time, but generate asecond, different output (e.g., a different object identification) fromthe same image frame at a later time—due to intervening use of thecamera.

A characteristic pattern of the user's hand jitter may also be inferredby examination of plural images. For example, by examining pictures ofdifferent exposure periods, it may be found that the user has a jitterwith a frequency of about 4 Hertz, which is predominantly in theleft-right (horizontal) direction. Sharpening filters tailored to thatjitter behavior (and also dependent on the length of the exposure) canthen be applied to enhance the resulting imagery.

In similar fashion, through use, the device may notice that the imagescaptured by the user during weekday hours of 9:00-5:00 are routinelyilluminated with a spectrum characteristic of fluorescent lighting, towhich a rather extreme white-balancing operation needs to be applied totry and compensate. With a priori knowledge of this tendency, the devicecan expose photos captured during those hours differently than with itsbaseline exposure parameters—anticipating the fluorescent illumination,and allowing a better white balance to be achieved.

Over time the device derives information that models some aspect of theuser's customary behavior or environmental variables. The device thenadapts some aspect of its operation accordingly.

The device may also adapt to its own peculiarities or degradations.These include non-uniformities in the photodiodes of the image sensor,dust on the image sensor, mars on the lens, etc.

Again, over time, the device may detect a recurring pattern, e.g.: (a)that one pixel gives a 2% lower average output signal than adjoiningpixels; (b) that a contiguous group of pixels tends to output signalsthat are about 3 digital numbers lower than averages would otherwiseindicate; (c) that a certain region of the photosensor does not seem tocapture high frequency detail—imagery in that region is consistently abit blurry, etc. From such recurring phenomena, the device can deduce,e.g., that (a) the gain for the amplifier serving this pixel is low; (b)dust or other foreign object is occluding these pixels; and (c) a lensflaw prevents light falling in this region of the photosensor from beingproperly focused, etc. Appropriate compensations can then be applied tomitigate these shortcomings.

Common aspects of the subject-matter or “scenes being imaged” is anotherrich source of information for subsequent image analysis routines, or atleast early-stage image processing steps which assist later stage imageanalysis routines by optimally filtering and/or transforming the pixeldata. For example, it may become clear over days and weeks of camerausage that a given user only uses their cameras for three basicinterests: digital watermark reading, barcode reading, and visuallogging of experimental set-ups in a laboratory. A histogram can bedeveloped over time showing which “end result” operation some givencamera usage led toward, followed by an increase in processing cyclesdevoted to early detections of both watermark and barcode basiccharacteristics. Drilling a bit deeper here, a Fourier-transformed setof image data may be preferentially routed to a quick 2-D barcodedetection function which may otherwise have been de-prioritized.Likewise on digital watermark reading, where Fourier transformed datamay be shipped to a specialized pattern recognition routine. A partiallyabstract way to view this state-machine change is that there is only afixed amount of CPU and image-processing cycles available to a cameradevice, and choices need to be made on which modes of analysis get whatportions of those cycles.

An over-simplified representation of such embodiments is shown in FIG.59.

By arrangements such as just-discussed, operation of an imager-equippeddevice evolves through its continued operation.

Focus Issues, Enhanced Print-to-Web Linking Based on Page Layout

Cameras currently provided with most cell phones, and other portablePDA-like devices, do not generally have adjustable focus. Rather, theoptics are arranged in compromise fashion—aiming to get a decent imageunder typical portrait snapshot and landscape circumstances. Imaging atclose distances generally yields inferior results—losing high frequencydetail. (This is ameliorated by just-discussed “extended depth of field”image sensors, but widespread deployment of such devices has not yetoccurred.)

The human visual system has different sensitivity to imagery atdifferent spectral frequencies. Different image frequencies conveydifferent impressions. Low frequencies give global information about animage, such as its orientation and general shape. High frequencies givefine details and edges. As shown in FIG. 72, the sensitivity of thehuman vision system peaks at frequencies of about 10 cycles/mm on theretina, and falls away steeply on either side. (Perception also dependson contrast between features sought to be distinguished—the verticalaxis.) Image features with spatial frequencies and contrast in thecross-hatched zone are usually not perceivable by humans. FIG. 73 showsan image with the low and high frequencies depicted separately (on theleft and right).

Digital watermarking of print media, such as newspapers, can be effectedby tinting the page (before, during or after printing) with aninoffensive background pattern that steganographically conveys auxiliarypayload data. Different columns of text can be encoded with differentpayload data, e.g., permitting each news story to link to a differentelectronic resource (see, e.g., Digimarc's U.S. Pat. Nos. 6,985,600,6,947,571 and U.S. Pat. No. 6,724,912).

In accordance with another aspect of the present technology, theclose-focus shortcoming of portable imaging devices is overcome byembedding a lower frequency digital watermark (e.g., with a spectralcomposition centered on the left side of FIG. 72, above the curve).Instead of encoding different watermarks in different columns, the pageis marked with a single watermark that spans the page—encoding anidentifier for that page.

When a user snaps a picture of a newspaper story of interest (whichpicture may capture just text/graphics from the desiredstory/advertisement, or may span other content as well), the watermarkof that page is decoded (either locally by the device, remotely by adifferent device, or in distributed fashion).

The decoded watermark serves to index a data structure that returnsinformation to the device, to be presented on its display screen. Thedisplay presents a map of the newspaper page layout, with differentarticles/advertisements shown in different colors.

FIGS. 74 and 75 illustrate one particular embodiment. The original pageis shown in FIG. 74. The layout map displayed on the user device screenin shown in FIG. 75.

To link to additional information about any of the stories, the usersimply touches the portion of the displayed map corresponding to thestory of interest. (If the device is not equipped with a touch screen,the map of FIG. 75 can be presented with indicia identifying thedifferent map zones, e.g., 1, 2, 3 . . . or A, B, C . . . . The user canthen operate the device's numeric or alphanumeric user interface (e.g.,keypad) to identify the article of interest.)

The user's selection is transmitted to a remote server (which may be thesame one that served the layout map data to the portable device, oranother one), which then consults with stored data to identifyinformation responsive to the user's selection. For example, if the usertouches the region in the lower right of the page map, the remote systemmay instructs a server at buick-dot-com to transmit a page forpresentation on the user device, with more information the about theBuick Lucerne. Or the remote system can send the user device a link tothat page, and the device can then load the page. Or the remote systemcan cause the user device to present a menu of options, e.g., for a newsarticle the user may be given options to: listen to a related podcast;see earlier stories on the same topic; order reprints; download thearticle as a Word file, etc. Or the remote system can send the user alink to a web page or menu page by email, so that the user can reviewsame at a later time. (A variety of such different responses touser-expressed selections can be provided, as are known from the artcited herein.)

Instead of the map of FIG. 75, the system may cause the user device todisplay a screen showing a reduced scale version of the newspaper pageitself—like that shown in FIG. 74. Again, the user can simply touch thearticle of interest to trigger an associated response.

Or instead of a presenting a graphical layout of the page, the remotesystem can return titles of all the content on the page (e.g., “BanksOwe Billions . . . ”, “McCain Pins Hopes . . . ”, “Buick Lucerne”).These titles are presented in menu form on the device screen, and theuser touches the desired item (or enters a corresponding number/letterselection).

The layout map for each printed newspaper and magazine page is typicallygenerated by the publishing company as part of its layout process, e.g.,using automated software from vendors such as Quark, Impress and Adobe,etc. Existing software thus knows what articles and advertisementsappear in what spaces on each printed page. These same software tools,or others, can be adapted to take this layout map information, associatecorresponding links or other data for each story/advertisement, andstore the resulting data structure in a web-accessible server from whichportable devices can access same.

The layout of newspaper and magazine pages offers orientationinformation that can be useful in watermark decoding. Columns arevertical. Headlines and lines of text are horizontal. Even at very lowspatial image frequencies, such shape orientation can be distinguished.A user capturing an image of a printed page may not capture the content“squarely.” However, these strong vertical and horizontal components ofthe image are readily determined by algorithmic analysis of the capturedimage data, and allow the rotation of the captured image to bediscerned. This knowledge simplifies and speeds the watermark decodingprocess (since a first step in many watermark decoding operations is todiscern the rotation of the image from its originally-encoded state).

In another embodiment, delivery of a page map to the user device from aremote server is not required. Again, a region of a page spanningseveral items of content is encoded with a single watermark payload.Again, the user captures an image including content of interest. Thewatermark identifying the page is decoded.

In this embodiment, the captured image is displayed on the devicescreen, and the user touches the content region of particular interest.The coordinates of the user's selection within the captured image dataare recorded.

FIG. 76 is illustrative. The user has used an Apple iPhone, a T-MobileAndroid phone, or the like to capture an image from an excerpt from awatermarked newspaper page, and then touches an article of interest(indicated by the oval). The location of the touch within the imageframe is known to the touch screen software, e.g., as an offset from theupper left corner, measured in pixels. (The display may have aresolution of 480×320 pixels). The touch may be at pixel position(200,160).

The watermark spans the page, and is shown in FIG. 76 by the dasheddiagonal lines. The watermark (e.g., as described in Digimarc's U.S.Pat. No. 6,590,996) has an origin, but the origin point is not withinthe image frame captured by the user. However, from the watermark, thewatermark decoder software knows the scale of the image and itsrotation. It also knows the offset of the captured image frame from thewatermark's origin. Based on this information, and information about thescale at which the original watermark was encoded (which information canbe conveyed with the watermark, accessed from a remote repository,hard-coded into the detector, etc.), the software can determine that theupper left corner of the captured image frame corresponds to a point 1.6inches below, and 2.3 inches to the right, of the top left corner of theoriginally printed page (assuming the watermark origin is at the topleft corner of the page). From the decoded scale information, thesoftware can discern that the 480 pixel width of the captured imagecorresponds to an area of the originally printed page 12 inches inwidth.

The software finally determines the position of the user's touch, as anoffset from the upper left corner of the originally-printed page. Itknows the corner of the captured image is offset (1.6″,2.3″) from theupper left corner of the printed page, and that the touch is a further5″ to the right (200 pixels×12″/480 pixels), and a further 4″ down (160pixels*12″/480 pixels), for a final position within theoriginally-printed page of (6.6″,6.3″).

The device then sends these coordinates to the remote server, togetherwith the payload of the watermark (identifying the page). The serverlooks up the layout map of the identified page (from an appropriatedatabase in which it was stored by the page layout software) and, byreference to the coordinates, determines in which of thearticles/advertisements the user's touch fell. The remote system thenreturns to the user device responsive information related to theindicated article, as noted above.

Returning to focus, the close-focus handicap of PDA cameras can actuallybe turned to advantage in decoding watermarks. No watermark informationis retrieved from inked areas of text. The subtle modulations ofluminance on which most watermarks are based are lost in regions thatare printed full-black.

If the page substrate is tinted with a watermark, the useful watermarkinformation is recovered from those regions of the page that areunprinted, e.g., from “white space” between columns, between lines, atthe end of paragraphs, etc. The inked characters are “noise” that isbest ignored. The blurring of printed portions of the page introduced byfocus deficiencies of PDA cameras can be used to define amask—identifying areas that are heavily inked. Those portions may bedisregarded when decoding watermark data.

More particularly, the blurred image data can be thresholded. Any imagepixels having a value darker than a threshold value can be ignored. Putanother way, only image pixels having a value lighter than a thresholdare input to the watermark decoder. The “noise” contributed by the inkedcharacters is thus filtered-out.

In imaging devices that capture sharply-focused text, a similaradvantage may be produced by processing the text with a blurringkernel—and subtracting out those regions that are thus found to bedominated by printed text.

By arrangements such as detailed by the foregoing, deficiencies ofportable imaging devices are redressed, and enhanced print-to-weblinking based on page layout data is enabled.

Image Search, Feature Extraction, Pattern Matching, Etc.

Image search functionality in certain of the foregoing embodiments canbe implemented using Pixsimilar image search software and/or the VisualSearch Developer's Kit (SDK), both from Idée, Inc. (Toronto, ON). A toolfor automatically generating descriptive annotations for imagery isALIPR (Automatic Linguistic Indexing of Pictures), as detailed in U.S.Pat. No. 7,394,947 (Penn State).

Content-based image retrieval (CBIR) can also be used in the foregoingembodiments. As is familiar to artisans, CBIR essentially involves (1)abstracting a characterization of an image—usually mathematically; and(2) using such characterizations to assess similarity between images.Two papers surveying these fields are Smeulders et al, “Content-BasedImage Retrieval at the End of the Early Years,” IEEE Trans. PatternAnal. Mach. Intell., Vol. 22, No. 12, pp. 1349-1380, 2000, and Datta etal, “Image Retrieval: Ideas, Influences and Trends of the New Age,” ACMComputing Surveys, Vol. 40, No. 2, April 2008.

The task of identifying like-appearing imagery from large imagedatabases is a familiar operation in the issuance of drivers licenses.That is, an image captured from a new applicant is commonly checkedagainst a database of all previous driver license photos, to checkwhether the applicant has already been issued a driver's license(possibly under another name). Methods and systems known from thedriver's license field can be employed in the arrangements detailedherein. (Examples include Identix U.S. Pat. No. 7,369,685 and L-1 Corp.U.S. Pat. Nos. 7,283,649 and 7,130,454.)

Useful in many of the embodiments herein are image feature extractionalgorithms known as CEDD and FCTH. The former is detailed inChatzichristofis et al, “CEDD: Color and Edge Directivity Descriptor—ACompact Descriptor for Image Indexing and Retrieval,” 6th InternationalConference in advanced research on Computer Vision Systems ICVS 2008,May, 2008; the latter is detailed in Chatzichristofis et al, “FCTH:Fuzzy Color And Texture Histogram—A Low Level Feature for Accurate ImageRetrieval” 9th International Workshop on Image Analysis for MultimediaInteractive Services”, Proceedings: IEEE Computer Society, May, 2008.

Open-source software implementing these techniques is available; see theweb pagesavvash.blogspot-dot-com/2008/05/cedd-and-fcth-are-now-open-dot-html.DLLs implementing their functionality can be downloaded; the classes canbe invoked on input image data (e.g., file.jpg) as follows:

double [ ] CEDDTable = new double[144]; double [ ] FCTHTable = newdouble[144]; Bitmap ImageData = new Bitmap(“c:/file.jpg”); CEDD GetCEDD= new CEDD( ); FCTH GetFCTH = new FCTH( ); CEDDTable =GetCEDD.Apply(ImageData); FCTHTable = GetFCTH.Apply(ImageData,2);

CEDD and FCTH can be combined, to yield improved results, using theJoint Composite Descriptor file available from the just-cited web page.

Chatzichristofis has made available an open source program “img(Finder)”(see the web pagesavvash.blogspot-dot-com/2008/07/image-retrieval-in-facebook-dot-html)—acontent based image retrieval desktop application that retrieves andindexes images from the FaceBook social networking site, using CEDD andFCTH. In use, a user connects to FaceBook with their personal accountdata, and the application downloads information from the images of theuser, as well as the user's friends' image albums, to index these imagesfor retrieval with the CEDD and FCTH features. The index can thereafterbe queried by a sample image.

Chatzichristofis has also made available an online search service“img(Anaktisi)” to which a user uploads a photo, and the servicesearches one of 11 different image archives for similar images—usingimage metrics including CEDD and FCTH. Seeorpheus.ee.duth-dot-gr/anaktisi/. (The image archives include Flickr).In the associated commentary to the Anaktisi search service,Chatzichristofis explains:

-   -   The rapid growth of digital images through the widespread        popularization of computers and the Internet makes the        development of an efficient image retrieval technique        imperative. Content-based image retrieval, known as CBIR,        extracts several features that describe the content of the        image, mapping the visual content of the images into a new space        called the feature space. The feature space values for a given        image are stored in a descriptor that can be used for retrieving        similar images. The key to a successful retrieval system is to        choose the right features that represent the images as        accurately and uniquely as possible. The features chosen have to        be discriminative and sufficient in describing the objects        present in the image. To achieve these goals, CBIR systems use        three basic types of features: color features, texture features        and shape features. It is very difficult to achieve satisfactory        retrieval results using only one of these feature types.    -   To date, many proposed retrieval techniques adopt methods in        which more than one feature type is involved. For instance,        color, texture and shape features are used in both IBM's QBIC        and MIT's Photobook. QBIC uses color histograms, a moment-based        shape feature, and a texture descriptor. Photobook uses        appearance features, texture features, and 2D shape features.        Other CBIR systems include SIMBA, CIRES, SIMPLIcity, IRMA, FIRE        and MIRROR. A cumulative body of research presents extraction        methods for these feature types.    -   In most retrieval systems that combine two or more feature        types, such as color and texture, independent vectors are used        to describe each kind of information. It is possible to achieve        very good retrieval scores by increasing the size of the        descriptors of images that have a high dimensional vector, but        this technique has several drawbacks. If the descriptor has        hundreds or even thousands of bins, it may be of no practical        use because the retrieval procedure is significantly delayed.        Also, increasing the size of the descriptor increases the        storage requirements which may have a significant penalty for        databases that contain millions of images. Many presented        methods limit the length of the descriptor to a small number of        bins, leaving the possible factor values in decimal,        non-quantized, form.    -   The Moving Picture Experts Group (MPEG) defines a standard for        content-based access to multimedia data in their MPEG-7        standard. This standard identifies a set of image descriptors        that maintain a balance between the size of the feature and the        quality of the retrieval results.    -   In this web-site a new set of feature descriptors is presented        in a retrieval system. These descriptors have been designed with        particular attention to their size and storage requirements,        keeping them as small as possible without compromising their        discriminating ability. These descriptors incorporate color and        texture information into one histogram while keeping their sizes        between 23 and 74 bytes per image.    -   High retrieval scores in content-based image retrieval systems        can be attained by adopting relevance feedback mechanisms. These        mechanisms require the user to grade the quality of the query        results by marking the retrieved images as being either relevant        or not. Then, the search engine uses this grading information in        subsequent queries to better satisfy users' needs. It is noted        that while relevance feedback mechanisms were first introduced        in the information retrieval field, they currently receive        considerable attention in the CBIR field. The vast majority of        relevance feedback techniques proposed in the literature are        based on modifying the values of the search parameters so that        they better represent the concept the user has in mind. Search        parameters are computed as a function of the relevance values        assigned by the user to all the images retrieved so far. For        instance, relevance feedback is frequently formulated in terms        of the modification of the query vector and/or in terms of        adaptive similarity metrics.    -   Also, in this web-site an Auto Relevance Feedback (ARF)        technique is introduced which is based on the proposed        descriptors. The goal of the proposed Automatic Relevance        Feedback (ARF) algorithm is to optimally readjust the initial        retrieval results based on user preferences. During this        procedure the user selects from the first round of retrieved        images one as being relevant to his/her initial retrieval        expectations. Information from these selected images is used to        alter the initial query image descriptor.

Another open source Content Based Image Retrieval system is GIFT (GNUImage Finding Tool), produced by researchers at the University ofGeneva. One of the tools allows users to index directory treescontaining images. The GIFT server and its client (SnakeCharmer) canthen be used to search the indexed images based on image similarity. Thesystem is further described at the web pagegnu-dot-org/software/gift/gift-dot-html. The latest version of thesoftware can be found at the ftp server ftp.gnu-dot-org/gnu/gift.

Still another open source CBIR system is Fire, written by Tom Deselaersand others at RWTH Aachen University, available for download from theweb page—i6.informatik.rwth-aachen-dot-de/˜deselaers/fire/. Fire makesuse of technology described, e.g., in Deselaers et al, “Features forImage Retrieval: An Experimental Comparison”, Information Retrieval,Vol. 11, No. 2, The Netherlands, Springer, pp. 77-107, March, 2008.

Embodiments of the present invention are generally concerned withobjects depicted in imagery, rather than full frames of image pixels.Recognition of objects within imagery (sometimes termed computer vision)is a large science with which the reader is presumed to be familiar.Edges and centroids are among the image features that can be used to aidin recognizing objects in images. Shape contexts are another (c.f.,Belongie et al, Matching with Shape Contexts, IEEE Workshop on ContentBased Access of Image and Video Libraries, 2000.) Robustness to affinetransformations (e.g., scale invariance, rotation invariance) is anadvantageous feature of certain object recognition/patternmatching/computer vision techniques. Methods based on the Houghtransform, and the Fourier Mellin transform, exhibit rotation-invariantproperties. SIFT (discussed below) is an image recognition techniquewith this and other advantageous properties.

In addition to object recognition/computer vision, the processing ofimagery contemplated in this specification (as opposed to the processingassociated metadata) can use of various other techniques, which can goby various names. Included are image analysis, pattern recognition,feature extraction, feature detection, template matching, facialrecognition, eigenvectors, etc. (All these terms are generally usedinterchangeably in this specification.) The interested reader isreferred to Wikipedia, which has an article on each of the just-listedtopics, including a tutorial and citations to related information.Excerpts from circa September, 2008 versions of these Wikipedia articlesare appended to the end of the provisional specification to which thisapplication claims priority.

Image metrics of the sort discussed are sometimes regarded as metadata,namely “content-dependent metadata.” This is in contrast to“content-descriptive metadata”—which is the more familiar sense in whichthe term metadata is used.

SIFT

Reference is sometimes made to SIFT techniques. SIFT is an acronym forScale-Invariant Feature Transform, a computer vision technologypioneered by David Lowe and described in various of his papers including“Distinctive Image Features from Scale-Invariant Keypoints,”International Journal of Computer Vision, 60, 2 (2004), pp. 91-110; and“Object Recognition from Local Scale-Invariant Features,” InternationalConference on Computer Vision, Corfu, Greece (September 1999), pp.1150-1157, as well as in U.S. Pat. No. 6,711,293.

SIFT works by identification and description—and subsequent detection—oflocal image features. The SIFT features are local and based on theappearance of the object at particular interest points, and areinvariant to image scale, rotation and affine transformation. They arealso robust to changes in illumination, noise, and some changes inviewpoint. In addition to these properties, they are distinctive,relatively easy to extract, allow for correct object identification withlow probability of mismatch and are straightforward to match against a(large) database of local features. Object description by a set of SIFTfeatures is also robust to partial occlusion; as few as three SIFTfeatures from an object are enough to compute its location and pose.

The technique starts by identifying local image features—termedkeypoints—in a reference image. This is done by convolving the imagewith Gaussian blur filters at different scales (resolutions), anddetermining differences between successive Gaussian-blurred images.Keypoints are those image features having maxima or minima of thedifference of Gaussians occurring at multiple scales. (Each pixel in adifference-of-Gaussian frame is compared to its eight neighbors at thesame scale, and corresponding pixels in each of the neighboring scales(e.g., nine other scales). If the pixel value is a maximum or minimumfrom all these pixels, it is selected as a candidate keypoint.

(It will be recognized that the just-described procedure is ablob-detection method that detects space-scale extrema of ascale-localized Laplacian transform of the image. The difference ofGaussians approach is an approximation of such Laplacian operation,expressed in a pyramid setting.)

The above procedure typically identifies many keypoints that areunsuitable, e.g., due to having low contrast (thus being susceptible tonoise), or due to having poorly determined locations along an edge (theDifference of Gaussians function has a strong response along edges,yielding many candidate keypoints, but many of these are not robust tonoise). These unreliable keypoints are screened out by performing adetailed fit on the candidate keypoints to nearby data for accuratelocation, scale, and ratio of principal curvatures. This rejectskeypoints that have low contrast, or are poorly located along an edge.

More particularly this process starts by—for each candidatekeypoint—interpolating nearby data to more accurately determine keypointlocation. This is often done by a Taylor expansion with the keypoint asthe origin, to determine a refined estimate of maxima/minima location.

The value of the second-order Taylor expansion can also be used toidentify low contrast keypoints. If the contrast is less than athreshold (e.g., 0.03), the keypoint is discarded.

To eliminate keypoints having strong edge responses but that are poorlylocalized, a variant of a corner detection procedure is applied.Briefly, this involves computing the principal curvature across theedge, and comparing to the principal curvature along the edge. This isdone by solving for eigenvalues of a second order Hessian matrix.

Once unsuitable keypoints are discarded, those that remain are assessedfor orientation, by a local image gradient function. Magnitude anddirection of the gradient is calculated for every pixel in a neighboringregion around a keypoint in the Gaussian blurred image (at thatkeypoint's scale). An orientation histogram with 36 bins is thencompiled—with each bin encompassing ten degrees of orientation. Eachpixel in the neighborhood contributes to the histogram, with thecontribution weighted by its gradient's magnitude and by a Gaussian withσ1.5 times the scale of the keypoint. The peaks in this histogram definethe keypoint's dominant orientation. This orientation data allows SIFTto achieve rotation robustness, since the keypoint descriptor can berepresented relative to this orientation.

From the foregoing, plural keypoints are different scales areidentified—each with corresponding orientations. This data is invariantto image translation, scale and rotation. 128 element descriptors arethen generated for each keypoint, allowing robustness to illuminationand 3D viewpoint.

This operation is similar to the orientation assessment procedurejust-reviewed. The keypoint descriptor is computed as a set oforientation histograms on (4×4) pixel neighborhoods. The orientationhistograms are relative to the keypoint orientation and the orientationdata comes from the Gaussian image closest in scale to the keypoint'sscale. As before, the contribution of each pixel is weighted by thegradient magnitude, and by a Gaussian with σ 1.5 times the scale of thekeypoint. Histograms contain 8 bins each, and each descriptor contains a4×4 array of 16 histograms around the keypoint. This leads to a SIFTfeature vector with (4×4×8=128 elements). This vector is normalized toenhance invariance to changes in illumination.

The foregoing procedure is applied to training images to compile areference database. An unknown image is then processed as above togenerate keypoint data, and the closest-matching image in the databaseis identified by a Euclidian distance-like measure. (A “best-bin-first”algorithm is typically used instead of a pure Euclidean distancecalculation, to achieve several orders of magnitude speed improvement.)To avoid false positives, a “no match” output is produced if thedistance score for the best match is close—e.g., 25% to the distancescore for the next-best match.

To further improve performance, an image may be matched by clustering.This identifies features that belong to the same referenceimage—allowing unclustered results to be discarded as spurious. A Houghtransform can be used—identifying clusters of features that vote for thesame object pose.

An article detailing a particular hardware embodiment for performing theSIFT procedure is Bonato et al, “Parallel Hardware Architecture forScale and Rotation Invariant Feature Detection,” IEEE Trans on Circuitsand Systems for Video Tech, Vol. 18, No. 12, 2008. A block diagram ofsuch arrangement 70 is provided in FIG. 18 (adapted from Bonato).

In addition to the camera 32, which produces the pixel data, there arethree hardware modules 72-74. Module 72 receives pixels from the cameraas input, and performs two types of operations: a Gaussian filter, anddifference of Gaussians. The former are sent to module 73; the latterare sent to module 74. Module 73 computes pixel orientation and gradientmagnitude. Module 74 detects keypoints and performs stability checks toensure that the keypoints may be relied on as identifying features.

A software block 75 (executed on an Altera NIOS II field programmablegate array) generates a descriptor for each feature detected by block 74based on the pixel orientation and gradient magnitude produced by block73.

In addition to the different modules executing simultaneously, there isparallelism within each hardware block. Bonato's illustrativeimplementation processes 30 frames per second. A cell phoneimplementation may run somewhat more slowly, such as 10 fps—at least inthe initial generation.

The reader is referred to the Bonato article for further details.

An alternative hardware architecture for executing SIFT techniques isdetailed in Se et al, “Vision Based Modeling and Localization forPlanetary Exploration Rovers,” Proc. of Int. Astronautical Congress(IAC), October, 2004.

Still another arrangement is detailed in Henze et al, “What is That?Object Recognition from Natural Features on a Mobile Phone,” MobileInteraction with the Real World, Bonn, 2009. Henze et al use techniquesby Nister et al, and Schindler et al, to expand the use of objects thatme be recognized, through use of a tree approach (see, e.g., Nister etal, “Scalable Recognition with a Vocabulary Tree,” proc. of ComputerVision and Pattern Recognition, 2006, and Schindler et al, “City-ScaleLocation Recognition, Proc. of Computer Vision and pattern Recognition,2007.)

The foregoing implementations can be employed on cell phone platforms,or the processing can be distributed between a cell phone and one ormore remote service providers (or it may be implemented with allimage-processing performed off-phone).

Published patent application WO07/130688 concerns a cell phone-basedimplementation of SIFT, in which the local descriptor features areextracted by the cell phone processor, and transmitted to a remotedatabase for matching against a reference library.

While SIFT is perhaps the most well known technique for generatingrobust local descriptors, there are others, which may be more or lesssuitable—depending on the application. These include GLOH (c.f.,Mikolajczyk et al, “Performance Evaluation of Local Descriptors,” IEEETrans. Pattern Anal. Mach. Intell., Vol. 27, No. 10, pp. 1615-1630,2005); and SURF (c.f., Bay et al, “SURF: Speeded Up Robust Features,”Eur. Conf. on Computer Vision (1), pp. 404-417, 2006); as well as Chenet al, “Efficient Extraction of Robust Image Features on MobileDevices,” Proc. of the 6^(th) IEEE and ACM Int. Symp. On Mixed andAugmented Reality, 2007; and Takacs et al, “Outdoors Augmented Realityon Mobile Phone Using Loxel-Based Visual Feature Organization,” ACM Int.Conf. on Multimedia Information Retrieval, October 2008. A survey oflocal descriptor features is provided in Mikolajczyk et al, “APerformance Evaluation of Local Descriptors,” IEEE Trans. On PatternAnalysis and Machine Intelligence, 2005.

The Takacs paper teaches that image matching speed is greatly increasedby limiting the universe of reference images (from which matches aredrawn), e.g., to those that are geographically close to the user'spresent position (e.g., within 30 meters). Applicants believe theuniverse can be advantageously limited—by user selection or otherwise—tospecialized domains, such as faces, grocery products, houses, etc.

More on Audio Applications

A voice conversation on a mobile device naturally defines the constructof a session, providing a significant amount of metadata (mostlyadministrative information in the form of an identified caller,geographic location, etc.) that can be leveraged to prioritize audiokeyvector processing.

If a call is received without accompanying CallerID information, thiscan trigger a process of voice pattern matching with past calls that arestill in voicemail, or for which keyvector data has been preserved.(Google Voice is a long term repositority of potentially useful voicedata for recognition or matching purposes.)

If the originating geography of a call can be identified but it is notfamiliar number (e.g., it is not in the user's contacts list nor acommonly received number), functional blocks aimed at speech recognitioncan be invoked—taking into account the originating geography. Forexample, if it is a foreign country, speech recognition in the languageof that country can be initiated. If the receiver accepts the call,simultaneously speech-to-text conversion in the native language of theuser can be initiated and displayed on screen to aide in theconversation. If the geography is domestic, it may allow recall ofregional dialect/accent-specific speech recognition libraries, to bettercope with a southern drawl, or Boston accent.

Once a conversation has been initiated, prompts based on speechrecognition can be provided on the cell phone screen (or another). Ifthe speaker on the far end of the connection begins discussions on aparticular topic, the local device can leverage resultant text to createnatural language queries to reference sites such as Wikipedia, scour thelocal user's calendar to check for availability, transcribe shoppinglists, etc.

Beyond evaluation and processing of speech during the session, otheraudio can be analyzed as well. If the user on the far end of theconversation cannot, or chooses not to, do local processing andkeyvector creation, this can be accomplished on the local user'shandset, allowing remote experiences to be shared locally.

It should be clear that all of the above holds true for video calls aswell, where both audio and visual information can be parsed andprocessed into keyvectors.

Other Comments

Having described and illustrated the principles of our inventive workwith reference to illustrative examples, it will be recognized that thetechnology is not so limited.

For example, while reference has been made to cell phones, it will berecognized that this technology finds utility with all manner ofdevices—both portable and fixed. PDAs, organizers, portable musicplayers, desktop computers, laptop computers, tablet computers,netbooks, ultraportables, wearable computers, servers, etc., can allmake use of the principles detailed herein. Particularly contemplatedcell phones include the Apple iPhone, and cell phones following Google'sAndroid specification (e.g., the G1 phone, manufactured for T-Mobile byHTC Corp.). The term “cell phone” should be construed to encompass allsuch devices, even those that are not strictly-speaking cellular, nortelephones.

(Details of the iPhone, including its touch interface, are provided inApple's published patent application 20080174570.)

The design of cell phones and other computers referenced in thisdisclosure is familiar to the artisan. In general terms, each includesone or more processors, one or more memories (e.g. RAM), storage (e.g.,a disk or flash memory), a user interface (which may include, e.g., akeypad, a TFT LCD or OLED display screen, touch or other gesturesensors, a camera or other optical sensor, a compass sensor, a 3Dmagnetometer, a 3-axis accelerometer, a microphone, etc., together withsoftware instructions for providing a graphical user interface),interconnections between these elements (e.g., buses), and an interfacefor communicating with other devices (which may be wireless, such asGSM, CDMA, W-CDMA, CDMA2000, TDMA, EV-DO, HSDPA, WiFi, WiMax, orBluetooth, and/or wired, such as through an Ethernet local area network,a T-1 internet connection, etc).

The arrangements detailed in this specification can also be employed inportable monitoring devices such as Personal People Meters(PPMs)—pager-sized devices that sense ambient media for audience surveypurposes (see, e.g., Nielsen patent publication 20090070797, andArbitron U.S. Pat. Nos. 6,871,180 and 7,222,071). The same principlescan also be applied to different forms of content that may be providedto a user online See, in this regard, Nielsen's patent application20080320508, which details a network-connected media monitoring device.

While this specification earlier noted its relation to the assignee'sprevious patent filings, it bears repeating. These disclosures should beread in concert and construed as a whole. Applicants intend thatfeatures in each be combined with features in the others. Thus, forexample, arrangements employing ThingPipe technology as detailed inapplication Ser. No. 12/498,709 may be implemented to also includefeatures and arrangements detailed in the present application—and viceversa. Signal processing disclosed in application Ser. Nos. 12/271,772and 12/490,980 can be implemented using the architectures and cloudarrangements detailed in the present specification, while thecrowd-sourced databases, cover flow user interfaces, and other featuresdetailed in the '772 and '980 applications can be incorporated inembodiments of the presently disclosed technologies. Etc, etc. Thus, itshould be understood that the methods, elements and concepts disclosedin the present application be combined with the methods, elements andconcepts detailed in those related applications. While some have beenparticularly detailed in the present specification, many have not—due tothe large number of permutations and combinations is large. However,implementation of all such combinations is straightforward to theartisan from the provided teachings.

Elements and teachings within the different embodiments disclosed in thepresent specification are also meant to be exchanged and combined. Forexample, teachings detailed in the context of FIGS. 1-12 can be used inthe arrangements of FIGS. 14-20, and vice versa.

The processes and system components detailed in this specification maybe implemented as instructions for computing devices, including generalpurpose processor instructions for a variety of programmable processors,including microprocessors, graphics processing units (GPUs, such as thenVidia Tegra APX 2600), digital signal processors (e.g., the TexasInstruments TMS320 series devices), etc. These instructions may beimplemented as software, firmware, etc. These instructions can also beimplemented to various forms of processor circuitry, includingprogrammable logic devices, FPGAs (e.g., the noted Xilinx Virtex seriesdevices), FPOAs (e.g., the noted PicoChip devices), and applicationspecific circuits—including digital, analog and mixed analog/digitalcircuitry. Execution of the instructions can be distributed amongprocessors and/or made parallel across processors within a device oracross a network of devices. Transformation of content signal data mayalso be distributed among different processor and memory devices.References to “processors” or “modules” (such as a Fourier transformprocessor, or an FFT module, etc.) should be understood to refer tofunctionality, rather than requiring a particular form ofimplementation.

References to FFTs should be understood to also include inverse FFTs,and related transforms (e.g., DFT, DCT, their respective inverses,etc.).

Software instructions for implementing the detailed functionality can bereadily authored by artisans, from the descriptions provided herein,e.g., written in C, C++, Visual Basic, Java, Python, Tcl, Perl, Scheme,Ruby, etc. Cell phones and other devices according to the presenttechnology can include software modules for performing the differentfunctions and acts. Known artificial intelligence systems and techniquescan be employed to make the inferences, conclusions, and otherdeterminations noted above.

Commonly, each device includes operating system software that providesinterfaces to hardware resources and general purpose functions, and alsoincludes application software which can be selectively invoked toperform particular tasks desired by a user. Known browser software,communications software, and media processing software can be adaptedfor many of the uses detailed herein. Software and hardwareconfiguration data/instructions are commonly stored as instructions inone or more data structures conveyed by tangible media, such as magneticor optical discs, memory cards, ROM, etc., which may be accessed acrossa network. Some embodiments may be implemented as embedded systems—aspecial purpose computer system in which the operating system softwareand the application software is indistinguishable to the user (e.g., asis commonly the case in basic cell phones). The functionality detailedin this specification can be implemented in operating system software,application software and/or as embedded system software.

Different of the functionality can be implemented on different devices.For example, in a system in which a cell phone communicates with aserver at a remote service provider, different tasks can be performedexclusively by one device or the other, or execution can be distributedbetween the devices. Extraction of eigenvalue data from imagery is butone example of such a task. Thus, it should be understood thatdescription of an operation as being performed by a particular device(e.g., a cell phone) is not limiting but exemplary; performance of theoperation by another device (e.g., a remote server), or shared betweendevices, is also expressly contemplated. (Moreover, more than twodevices may commonly be employed. E.g., a service provider may refersome tasks, such as image search, object segmentation, and/or imageclassification, to servers dedicated to such tasks.)

(In like fashion, description of data being stored on a particulardevice is also exemplary; data can be stored anywhere: local device,remote device, in the cloud, distributed, etc.)

Operations need not be performed exclusively byspecifically-identifiable hardware. Rather, some operations can bereferred out to other services (e.g., cloud computing), which attend totheir execution by still further, generally anonymous, systems. Suchdistributed systems can be large scale (e.g., involving computingresources around the globe), or local (e.g., as when a portable deviceidentifies nearby devices through Bluetooth communication, and involvesone or more of the nearby devices in a task—such as contributing datafrom a local geography; see in this regard U.S. Pat. No. 7,254,406 toBeros.)

Similarly, while certain functions have been detailed as being performedby certain modules (e.g., control processor module 36, pipe manager 51,the query router and response manager of FIG. 7, etc), in otherimplementations such functions can be performed by other modules, or byapplication software (or dispensed with altogether).

The reader will note that certain discussions contemplate arrangementsin which most image processing is performed on the cell phone. Externalresources, in such arrangements, are used more as sources for data(e.g., Google) than for image processing tasks. Such arrangements cannaturally be practiced using the principles discussed in other sections,in which some or all of the hardcore crunching of image-related data isreferred out to external processors (service providers).

Likewise, while this disclosure has detailed particular ordering of actsand particular combinations of elements in the illustrative embodiments,it will be recognized that other contemplated methods may re-order acts(possibly omitting some and adding others), and other contemplatedcombinations may omit some elements and add others, etc.

Although disclosed as complete systems, sub-combinations of the detailedarrangements are also separately contemplated.

Reference was commonly made to the internet, in the illustrativeembodiments. In other embodiments, other networks—including privatenetworks of computers—can be employed also, or instead.

The reader will note that different names are sometimes used whenreferring to similar or identical components, processes, etc. This isdue, in part, to the development of this patent specification over thecourse of nearly a year—with terminology that shifted over time. Thus,for example, a “visual query packet” and a “keyvector” can both refer tothe same thing. Similarly with other terms.

In some modes, cell phones employing the present technology may beregarded as observational state machines.

While detailed primarily in the context of systems that perform imagecapture and processing, corresponding arrangements are equallyapplicable to systems that capture and process audio, or that captureand process both imagery and audio.

Some processing modules in an audio-based system may naturally bedifferent. For example, audio processing commonly relies on criticalband sampling (per the human auditory system). Cepstrum processing (aDCT of a power spectrum) is also frequently used.

An exemplary processing chain may include a band-pass filter to filteraudio captured by a microphone in order to remove low and highfrequencies, e.g., leaving the band 300-3000 Hz. A decimation stage mayfollow (reducing the sample rate, e.g., from 40K samples/second to 6Ksamples/second). An FFT can then follow. Power spectrum data can becomputed by squaring the output coefficients from the FFT (these may begrouped to effect critical band segmentation). Then a DCT may beperformed, to yield cepstrum data. Some of these operations can beperformed in the cloud. Outputs from any of these stages may be sent tothe cloud for application processing, such as speech recognition,language translation, anonymization (returning the same vocalizations ina different voice), etc. Remote systems can also respond to commandsspoken by a user and captured by a microphone, e.g., to control othersystems, supply information for use by another process, etc.

It will be recognized that the detailed processing of content signals(e.g., image signals, audio signals, etc.) includes the transformationof these signals in various physical forms. Images and video (forms ofelectromagnetic waves traveling through physical space and depictingphysical objects) may be captured from physical objects using cameras orother capture equipment, or generated by a computing device. Similarly,audio pressure waves traveling through a physical medium may be capturedusing an audio transducer (e.g., microphone) and converted to anelectronic signal (digital or analog form). While these signals aretypically processed in electronic and digital form to implement thecomponents and processes described above, they may also be captured,processed, transferred and stored in other physical forms, includingelectronic, optical, magnetic and electromagnetic wave forms. Thecontent signals are transformed in various ways and for various purposesduring processing, producing various data structure representations ofthe signals and related information. In turn, the data structure signalsin memory are transformed for manipulation during searching, sorting,reading, writing and retrieval. The signals are also transformed forcapture, transfer, storage, and output via display or audio transducer(e.g., speakers).

At the end of the provisional specification to which this applicationclaims priority are listings of further references—detailingtechnologies and teachings that applicants intend be incorporated intothe arrangements detailed herein (and into which applicants intend thatthe technologies and teachings detailed herein be incorporated).

In some embodiments, an appropriate response to captured imagery may bedetermined by reference to data stored in the device—without referenceto any external resource. (The registry database used in many operatingsystems is one place where response-related data for certain inputs canbe specified.) Alternatively, the information can be sent to a remotesystem—for it to determine the response.

The Figures not particularly identified above show aspects ofillustrative embodiments or details of the disclosed technology.

The information sent from the device may be raw pixels, or an image incompressed form, or a transformed counterpart to an image, orfeatures/metrics extracted from image data, etc. All may be regarded asimage data. The receiving system can recognize the data type, or it canbe expressly identified to the receiving system (e.g., bitmap,eigenvectors, Fourier-Mellin transform data, etc.), and that system canuse the data type as one of the inputs in deciding how to process.

If the transmitted data is full image data (raw, or in a compressedform), then there will be essentially no duplication in packets receivedby processing system—essentially every picture is somewhat different.However, if the originating device performs processing on the full imageto extract features or metrics, etc., then a receiving system maysometimes receive a packet identical to one it earlier encountered (ornearly so). In this case, the response for that “snap packet” (alsotermed a “pixel packet” or “keyvector”) may be recalled from acache—rather than being determined anew. (The response info may bemodified in accordance with user preference information, if availableand applicable.)

In certain embodiments it may be desirable for a capture device toinclude some form of biometric authentication, such as a fingerprintreader integrated with the shutter button, to assure than a known useris operating the device.

Some embodiments can capture several images of a subject, from differentperspectives (e.g., a video clip). Algorithms can then be applied tosynthesize a 3D model of the imaged subject matter. From such a modelnew views of the subject may be derived—views that may be more suitableas stimuli to the detailed processes (e.g., avoiding an occludingforeground object).

In embodiments using textual descriptors, it is sometimes desirable toaugment the descriptors with synonyms, hyponyms (more specific terms)and/or hypernyms (more general terms). These can be obtained from avariety of sources, including the WordNet database compiled by PrincetonUniversity.

Although many of the embodiments described above are in the context of acell phone that submits image data to a service provider, triggering acorresponding response, the technology is more generallyapplicable—whenever processing of imagery or other content occurs.

The focus of this disclosure has been on imagery. But the techniques areuseful with audio and video. The detailed technology is particularlyuseful with User Generated Content (UGC) sites, such as YouTube. Videosoften are uploaded with little or no metadata. Various techniques areapplied to identify same, with differing degrees of uncertainty (e.g.,reading watermarks; calculating fingerprints, human reviewers, etc.),and this identification metadata is stored. Further metadata isaccumulated based on profiles of users who view the video. Still furthermetadata can be harvested from later user comments posted about thevideo. (UGC—related arrangements in which applicants intend the presenttechnology be included are detailed in published patent applications20080208849 and 20080228733 (Digimarc), 20080165960 (TagStory),20080162228 (Trivid), 20080178302 and 20080059211 (Attributor),20080109369 (Google), 20080249961 (Nielsen), and 20080209502(MovieLabs).) By arrangements like that detailed herein, appropriatead/content pairings can be gleaned, and other enhancements to the users'experience can be offered.

Similarly, the technology can be used with audio captured by userdevices, and recognition of captured speech. Information gleaned fromany of the captured information (e.g., OCR'd text, decoded watermarkdata, recognized speech), can be used as metadata, for the purposesdetailed herein.

Multi-media applications of this technology are also contemplated. Forexample, an image may be patterned-matched or GPS-matched to identify aset of similar images in Flickr. Metadata descriptors can be collectedfrom that set of similar images, and used to query a database thatincludes audio and/or video. Thus, a user capturing and submitting animage of a trail marker on the Appalachian Trail (FIG. 38) may triggerdownload of the audio track from Aaron Copeland's “Appalachian Spring”orchestral suite to the user's cell phone, or home entertainment system.(About sending content to different destinations that may be associatedwith a user see, e.g., patent publication 20070195987.)

Repeated reference was made to GPS data. This should be understood as ashort-hand for any location-related information; it need not be derivedfrom the Global Positioning System constellation of satellites. Forexample, another technology that is suitable for generating locationdata relies on radio signals that are that commonly exchanged betweendevices (e.g., WiFi, cellular, etc.). Given several communicatingdevices, the signals themselves—and the imperfect digital clock signalsthat control them—form a reference system from which both highlyaccurate time and position can be abstracted. Such technology isdetailed in laid-open international patent publication WO08/073347. Theartisan will be familiar with several other location-estimatingtechniques, including those based on time of arrival techniques, andthose based on locations of broadcast radio and television towers (asoffered by Rosum) and WiFi nodes (as offered by Skyhook Wireless, andemployed in the iPhone), etc.

While geolocation data commonly comprises latitude and longitude data,it may alternatively comprise more, less, or different data. Forexample, it may include orientation information, such as compassdirection provided by a magnetometer, or inclination informationprovided by gyroscopic or other sensors. It may also include elevationinformation, such as provided by digital altimeter systems.

Reference was made to Apple's Bonjour software. Bonjour is Apple'simplementation of Zeroconf—a service discovery protocol. Bonjour locatesdevices on a local network, and identifies services that each offers,using multicast Domain Name System service records. This software isbuilt into the Apple MAC OS X operating system, and is also included inthe Apple “Remote” application for the iPhone—where it is used toestablish connections to iTunes libraries via WiFi. Bonjour services areimplemented at the application level largely using standard TCP/IPcalls, rather than in the operating system. Apple has made the sourcecode of the Bonjour multicast DNS responder—the core component ofservice discovery—available as a Darwin open source project. The projectprovides source code to build the responder daemon for a wide range ofplatforms, including Mac OS X, Linux, *BSD, Solaris, and Windows. Inaddition, Apple provides a user-installable set of services calledBonjour for Windows, as well as Java libraries. Bonjour can be used invarious applications of the present technology, involving communicationsbetween devices and systems.

(Other software can alternatively, or additionally, be used to exchangedata between devices. Examples include Universal Plug and Play (UPnP)and its successor Devices Profile for Web Services (DPWS). These areother protocols implementing zero configuration networking services,through which devices can connect, identify themselves, advertiseavailable capabilities to other devices, share content, etc.)

As noted earlier, artificial intelligence techniques can play animportant role in embodiments of the present technology. A recententrant into the field is the Wolfram Alpha product by Wolfram Research.Alpha computes answers and visualizations responsive to structuredinput, by reference to a knowledge base of curated data. Informationgleaned from metadata analysis or semantic search engines, as detailedherein, can be presented to the Wolfram Alpha product to provideresponsive information back to the user. In some embodiments, the useris involved in this submission of information, such as by structuring aquery from terms and other primitives gleaned by the system, byselecting from among a menu of different queries composed by the system,etc. Additionally, or alternatively, responsive information from theAlpha system can be provided as input to other systems, such as Google,to identify further responsive information. Wolfram's patentpublications 20080066052 and 20080250347 further detail aspects of thetechnology.

Another recent technical introduction is Google Voice (based on anearlier venture's GrandCentral product), which offers a number ofimprovements to traditional telephone systems. Such features can be usedin conjunction with application of the present technology.

For example, the voice to text transcription services offered by GoogleVoice can be employed to capture ambient audio from the speaker'senvironment using the microphone in the user's cell phone, and generatecorresponding digital data (e.g., ASCII information). The system cansubmit such data to services such as Google or Wolfram Alpha to obtainrelated information, which the system can then provide back to theuser—either by a screen display, or by voice. Similarly, the speechrecognition afforded by Google Voice can be used to provide aconversational user interface to cell phone devices, by which featuresof the technology detailed herein can be selectively invoked andcontrolled by spoken words.

In another aspect, when a user captures content (audio or visual) with acell phone device, and a system employing the presently disclosedtechnology returns a response, the response information can be convertedfrom text to speech, and delivered to the user's voicemail account inGoogle Voice. The user can access this data repository from any phone,or from any computer. The stored voice mail can be reviewed in itsaudible form, or the user can elect instead to review a textualcounterpart, e.g., presented on a cell phone or computer screen.

(Aspects of the Google Voice technology are detailed in patentapplication 20080259918.)

More than a century of history has accustomed users to think of phonesas communication devices that receive audio at point A, and deliver thataudio to point B. However, the present technology can be employed with amuch different effect. Audio-in, audio-out, may become a dated paradigm.In accordance with the present technology, phones are also communicationdevices that receive imagery (or other stimulus) at point A, leading todelivery of text, voice, data, imagery, video, smell, or other sensoryexperience at point B.

Instead of using the present technology as a query device—with a singlephone serving as both the input and output device, a user of the presenttechnology can direct that content responsive to the query be deliveredto one or several destination systems—which may or may not include theoriginating phone. (The recipient(s) can be selected by known UItechniques, including keypad entry, scrolling through a menu ofrecipients, voice recognition, etc.)

A simple illustration of this usage model is a person who uses a cellphone to capture a picture of a rose plant in bloom. Responsive to theuser's instruction, the picture—augmented by a synthesized smell of thatparticular variety of rose—is delivered to the user's girlfriend.(Arrangements for equipping computer devices to disperse programmablescents are known, e.g., the iSmell offering by Digiscents, andtechnologies detailed in patent documents 20080147515, 20080049960,20060067859, WO00/15268 and WO 00/15269). Stimulus captured by one userat one location can lead to delivery of a different but relatedexperiential stimulus to a different user at a different location.

In certain embodiments, a response to visual stimuli can include one ormore graphical overlays presented on the cell phone screen—atop imagedata from the cell phone camera. The overlay can be geometricallyregistered with features in the image data, and be affine-distorted incorrespondence with affine distortion of an object depicted in theimage. Such technology is further detailed, e.g., in Digimarc's patentpublication 20080300011. Such a graphic overlay can include menufeatures, with which a user can interact to perform desired functions.In addition, or alternatively, the overlay can include one or moregraphical user interface controls. For example, several differentobjects may be recognized within the camera's field of view. Overlaid inassociation with each may be a graphic, which can be touched by the userto obtain information, or trigger a function, related to that respectiveobject. The overlays may regarded as visual baubles—drawing attention tothe availability of information that may be accessed through userinteraction with such graphic features, e.g., such as by a user tappingon that location of the screen, or circling that region with a finger orstylus, etc. As the user changes the camera's perspective, differentbaubles may appear—tracking the movement of different objects in theunderlying, realworld imagery, and inviting the user to exploreassociated auxiliary information. Again, the overlays are desirablyorthographically correct, with affine-correct projection onto associatedreal world features. (The pose estimation of subjects as imaged in thereal world—from which appropriate spatial registration of overlays aredetermined—desirably is performed locally, but may be referred to thecloud depending on the application.)

The objects can be recognized, and tracked, and feedback provided, byoperations detailed above. For example, the local processor may performobject parsing and initial object recognition (e.g., inventoryingproto-objects). Cloud processes may complete recognition operations, andserve up appropriate interactive portals that are orthographicallyregistered onto the display scene (which registration may be performedby the local processor or by the cloud).

In some aspects, it will be recognized that the present technology actsas a graphical user interface—on a cell phone—to the real world.

In early implementations, general purpose visual query systems of thesort described will be relatively clunky, and not demonstrate muchinsight. However, by feeding a trickle (or torrent) of keyvector databack to the cloud for archiving and analysis (together with informationabout user action based on such data), those early systems can establishthe data foundation from which templates and other training models canbe built—enabling subsequent generations of such systems to be highlyintuitive and responsive when presented with visual stimuli. (Thistrickle can be provided by a subroutine on the local device whichoccasionally grabs bits of information about how the user is workingwith the device, what works, what doesn't, what selections the usermakes based on which stimuli, the stimuli involved, etc., and feeds sameto the cloud.)

Reference was made to touchscreen interfaces—a form of gestureinterface. Another form of gesture interface that can be used inembodiments of the present technology operates by sensing movement of acell phone—by tracking movement of features within captured imagery.Further information on such gestural interfaces is detailed inDigimarc's U.S. Pat. No. 6,947,571.

Watermark decoding can be used in certain embodiments. Technology forencoding/decoding watermarks is detailed, e.g., in Digimarc's U.S. Pat.Nos. 6,614,914 and 6,122,403; in Nielsen's U.S. Pat. No. 6,968,564 andU.S. Pat. No. 7,006,555; and in Arbitron's U.S. Pat. Nos. 5,450,490,5,764,763, 6,862,355, and 6,845,360.

Digimarc has various other patent filings relevant to the presentsubject matter. See, e.g., patent publications 20070156726, 20080049971,and 20070266252.

Google's book-scanning U.S. Pat. No. 7,508,978, details some principlesuseful in the present context. So does Google's patent applicationdetailing its visions for interacting with next generation television:20080271080.

Examples of audio fingerprinting are detailed in patent publications20070250716, 20070174059 and 20080300011 (Digimarc), 20080276265,20070274537 and 20050232411 (Nielsen), 20070124756 (Google), U.S. Pat.No. 7,516,074 (Auditude), and U.S. Pat. Nos. 6,990,453 and 7,359,889(both Shazam). Examples of image/video fingerprinting are detailed inpatent publications U.S. Pat. No. 7,020,304 (Digimarc), U.S. Pat. No.7,486,827 (Seiko-Epson), 20070253594 (Vobile), 20080317278 (Thomson),and 20020044659 (NEC).

Although certain aspects of the detailed technology involve processing alarge number of images to collect information, it will be recognizedthat related results can be obtained by having a large number of people(and/or automated processes) consider a single image (e.g.,crowd-sourcing). Still greater information and utility can be achievedby combining these two general approaches.

The illustrations are meant to be exemplary and not limiting. Forexample, they sometimes show multiple databases, when a single can beused (and vice-versa). Likewise, some links between the depicted blocksare not shown—for clarity's sake.

Contextual data can be used throughout the detailed embodiments tofurther enhance operation. For example, a process may depend on whetherthe originating device is a cell phone or a desktop computer; whetherthe ambient temperature is 30 or 80; the location of, and otherinformation characterizing the user; etc.

While the detailed embodiments often present candidate results/actionsas a series of cached displays on the cell phone screen, between whichthe user can rapidly switch, in other embodiments this need not be thecase. A more traditional single-screen presentation, giving a menu ofresults, can be used—and the user can press a keypad digit, or highlighta desired option, to make a selection. Or bandwidth may increasesufficiently that the same user experience can be provided withoutlocally caching or buffering data—but rather having it delivered to thecell phone as needed.

Geographically-based database methods are detailed, e.g., in Digimarc'spatent publication 20030110185. Other arrangements for navigatingthrough image collections, and performing search, are shown in patentpublications 20080010276 (Executive Development Corp.) and 20060195475,20070110338, 20080027985, 20080028341 (Microsoft's Photosynth work).

It is impossible to expressly catalog the myriad variations andcombinations of the technology described herein. Applicants recognizeand intend that the concepts of this specification can be combined,substituted and interchanged—both among and between themselves, as wellas with those known from the cited prior art. Moreover, it will berecognized that the detailed technology can be included with othertechnologies—current and upcoming—to advantageous effect.

To provide a comprehensive disclosure without unduly lengthening thisspecification, applicants incorporate-by-reference the documents andpatent disclosures referenced above. (Such documents are incorporated intheir entireties, even if cited above in connection with specific oftheir teachings.) These references disclose technologies and teachingsthat can be incorporated into the arrangements detailed herein, and intowhich the technologies and teachings detailed herein can beincorporated.

I claim:
 1. In a distributed processing method that includes performingan operation on stimuli captured by a camera or microphone sensor of auser's mobile device, by using a combined system that includes bothprocessing hardware in the mobile device and processing hardware remotefrom the mobile device, an improvement wherein: the operation comprisesan image or audio recognition operation, and the method includes:identifying, using a hardware processor, a set of component operationsthat should be executed by said combined system to perform saidrecognition operation; and determining a sequence in which saidcomponent operations should be performed based on one or morecircumstance or context factors selected from the group consisting of:(a) information about mobile device power availability or usage; (b)information about a needed function response time; (c) information abouta routing constraint; (d) information about a state of hardwareresources within the mobile device; (e) information about mobile deviceconnectivity; (f) information about a geographical consideration; (g)information about a pipeline stall risk; (h) information aboutturnaround time or cost associated with the remote processor; and (i)information about a user preference regarding remote processing; andwherein, at a first time, said component operations are performed in afirst sequence, and at a second time, said component operations areperformed in a second, different, sequence, due to a difference in oneor more of said factors between the first and second times.
 2. Themethod of claim 1 in which said act of identifying a set of componentoperations is also based on one or more circumstance or context factorsselected from said list, wherein at one time, a first set of componentoperations is identified to perform said recognition operation, and atanother time, a second, different, set of component functions isidentified to perform said recognition operation, due to a difference inone or more of said factors between said one and another times.
 3. Themethod of claim 1 in which said act of determining a sequence is basedon two or more of said factors.
 4. The method of claim 1 in which saidact of determining a sequence is based on one or more factors includinginformation about a needed function response time.
 5. The method ofclaim 1 in which said act of determining a sequence is based on one ormore factors including information about a routing constraint.
 6. Themethod of claim 5 in which the routing constraint is imposed by aprovider of a local wireless network.
 7. The method of claim 1 in whichsaid act of determining a sequence is based on one or more factorsincluding information about a state of hardware resources within theprocessing device.
 8. The method of claim 1 in which said act ofdetermining a sequence is based on one or more factors includinginformation about mobile device connectivity.
 9. The method of claim 1in which said act of determining a sequence is based on one or morefactors including information about a geographical consideration. 10.The method of claim 1 in which said act of determining a sequence isbased on one or more factors including information about a pipelinestall risk.
 11. The method of claim 10 that includes assessing thepipeline stall risk by reference to historical patterns, or based oninformation that completion of an operation requires further data ofuncertain availability.
 12. The method of claim 1 in which said act ofdetermining a sequence is based on one or more factors includinginformation about turnaround time or cost associated with the remoteprocessor.
 13. The method of claim 1 in which said act of determining asequence is based on one or more factors including information aboutremote processor readiness, or information about remote processor cost.14. The method of claim 1 in which said act of determining a sequence isbased on one or more factors including information about a userpreference regarding remote processing.
 15. The method of claim 14wherein the user preference comprises user preference about location ofa remote service provider.
 16. The method of claim 1 that includesdetermining that one component operation should be performed beforeanother component operation, and as a consequence, performing said onecomponent operation using processing hardware in the mobile device, andperforming said another component operation using processing hardwareremote from the mobile device.
 17. The method of claim 1 that includesdetermining that one component operation should be performed beforeanother component operation, and as a consequence, performing said onecomponent operation using processing hardware remote from the mobiledevice, and performing said another component operation using processinghardware in the mobile device.
 18. A mobile device comprising at leastone processor, memory, camera, and microphone, the memory containingsoftware instructions that configure the device to perform an image oraudio recognition operation in conjunction with a cooperating remoteprocessing device, the recognition operation comprising plural componentoperations—one or more of which are performed by said at least oneprocessor of the mobile device, and one or more of which are performedby the cooperating remote processing device, wherein said instructionsin the mobile device memory include instructions for determining asequence in which said component operations should be performed, basedon one or more circumstance or context factors selected from the groupconsisting of: (a) information about mobile device power availability orusage; (b) information about a needed function response time; (c)information about a routing constraint; (d) information about a state ofhardware resources within the mobile device; (e) information aboutmobile device connectivity; (f) information about a geographicalconsideration; (g) information about a pipeline stall risk; (h)information about turnaround time or cost associated with the remotedevice; and (i) information about a user preference regarding remoteprocessing; and wherein, at a first time, said component operations areperformed in a first sequence, and at a second time, said componentoperations are performed in a second, different, sequence, due to adifference in one or more of said factors between the first and secondtimes.